facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.12k stars 3.62k forks source link

Memory efficient training #2047

Closed rom1504 closed 3 years ago

rom1504 commented 3 years ago

I would like to train an index with a large amount of embedding in a memory constrained environment. I'm wondering what are the best ways to do it.

Currently I am using index.train(embeddings) which requires embeddings to be fully in memory. If training a large index, it can means having tens of GBs in memory, which eventually reaches the machine limits.

Are there some ways to avoid loading all the training embeddings in memory ?

Those are the ideas I have at the moment:

I would be interested to know if you have any advise on the topic, thanks!

mdouze commented 3 years ago

First, it seems suspicious that the memory for training vectors is the limiting factor, because if you maintain a reasonable ratio of

# training vectors / # centroids

(this is normally between 50 and 1000)

Then the CPU cost will almost certainly dominate.

The memory mapping would work, even from Python (load vectors with np.memmap)

rom1504 commented 3 years ago

Thanks for your answer and advices!

About the cpu/memory cost :

If using for example 1M of centroids, the memory need would be 50*10^6*512*4/(10^9) = 100GB for embeddings of size 512

the cpu cost to train with 1M of centroids would definitely be high, but that's only a time constraint whereas the memory constraint can be a blocker. GPU do not have 100GB of vram, so that definitely would be a blocker there too (and I saw in your benchmarks you trained up to 4M of centroids in the 1G embeddings setup (https://github.com/facebookresearch/faiss/wiki/Indexing-1G-vectors#1b-datasets) so I guess you found a solution for this case ?)

I'm working with 400M embeddings (and soon more), so increasing the number of centroids would I think help. (I'm only using 131072 centroids for now)

I will check if the memory mapping can work efficiently for this.

mdouze commented 3 years ago

At clustering time, the GPU does not need to store the training set, only the centroids, so that is not a blocker.

hustnn commented 1 year ago

At clustering time, the GPU does not need to store the training set, only the centroids, so that is not a blocker.

Hi @mdouze May I know why training set is not required, from https://github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization#clustering, the training set is the parameter of kmeans.train(x), Thanks.