Closed rom1504 closed 3 years ago
First, it seems suspicious that the memory for training vectors is the limiting factor, because if you maintain a reasonable ratio of
# training vectors / # centroids
(this is normally between 50 and 1000)
Then the CPU cost will almost certainly dominate.
The memory mapping would work, even from Python (load vectors with np.memmap)
Thanks for your answer and advices!
About the cpu/memory cost :
If using for example 1M of centroids, the memory need would be 50*10^6*512*4/(10^9)
= 100GB for embeddings of size 512
the cpu cost to train with 1M of centroids would definitely be high, but that's only a time constraint whereas the memory constraint can be a blocker. GPU do not have 100GB of vram, so that definitely would be a blocker there too (and I saw in your benchmarks you trained up to 4M of centroids in the 1G embeddings setup (https://github.com/facebookresearch/faiss/wiki/Indexing-1G-vectors#1b-datasets) so I guess you found a solution for this case ?)
I'm working with 400M embeddings (and soon more), so increasing the number of centroids would I think help. (I'm only using 131072 centroids for now)
I will check if the memory mapping can work efficiently for this.
At clustering time, the GPU does not need to store the training set, only the centroids, so that is not a blocker.
At clustering time, the GPU does not need to store the training set, only the centroids, so that is not a blocker.
Hi @mdouze May I know why training set is not required, from https://github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization#clustering, the training set is the parameter of kmeans.train(x), Thanks.
I would like to train an index with a large amount of embedding in a memory constrained environment. I'm wondering what are the best ways to do it.
Currently I am using index.train(embeddings) which requires embeddings to be fully in memory. If training a large index, it can means having tens of GBs in memory, which eventually reaches the machine limits.
Are there some ways to avoid loading all the training embeddings in memory ?
Those are the ideas I have at the moment:
I would be interested to know if you have any advise on the topic, thanks!