Distributed training - Githubissues

criteo / autofaiss

Automatically create Faiss knn indices with the most optimal similarity search parameters.

https://criteo.github.io/autofaiss/

Apache License 2.0

793 stars 73 forks source link

Distributed training #116

Open fonspa opened 2 years ago

fonspa commented 2 years ago

Hi, thanks to all maintainers of this project, that's a great tool to streamline the building and tuning of a Faiss index.

I have a quick dumb question about the training of an index in distributed mode. Am I correct that the training is done on the host, i.e non distributed, and that only the adding/optimizing part is distributed ? After a quick look at the code and doc, I feel like that's the case, right ? If that's the case, would there be a possibility of training the index in a distributed fashion?

rom1504 commented 2 years ago

hey! glad you like autofaiss

currently indeed autofaiss train the index on a single node. Usually this is not a problem because the number of points that are used for training is a small part of the whole embedding set, usually up to 32x the number of clusters so for example 3M, even for a billion size index. So it takes only up to an hour for training

However, it is indeed a technical possibility to distribute the training. If you have an use case that requires it, I advise you look into the pointers I put in that issue https://github.com/criteo/autofaiss/issues/101

What is your use case?

fonspa commented 2 years ago

Well usually I tend to train my indexes with a number of points on the higher end of the recommended range, about 128 to 256 * nCentroids.
For about half a billion base vectors, I generally try indexes with 2^17, 2^18 centroids, that's around 20M to 80M points to cluster, for, hopefully, the best possible coverage of the distribution of my points.
Training with this many points takes a while ! Maybe that's being a bit too cautious and I could take far less points to train.
Another thing to factor in is that I generally can't use all the cores on the host, as it's generally under medium to heavy use. I usually only get a fraction of those cores, while the cluster cores are plenty and less frequently occupied.
Thanks a lot for your answer and your pointer to the distributed kmeans, It looks very promising.

rom1504 commented 2 years ago

Did you measure better knn recall by using so many training points ? I used 2^17 centroids and 64x training points for a 5B embeddings index and it's working well.

In the past we did experiments with the number of training points and didn't see a big impact in using much more.

fonspa commented 2 years ago

I did notice a better tradeoff speed / recall@[1,K] (for some value of K that I need) when using more training points, most notably for queries coming from the database vectors.
But I might have to lower the number of training points anyway as I can't monopolize too many cores on the host for a too long time. That's why I was hoping for a training on the cluster cores instead.

Thanks for your input! That's nice and useful to hear how others are approaching the problem.