How to apply faiss on a massive dataset?

facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.

https://faiss.ai

MIT License

30.39k stars 3.55k forks source link

How to apply faiss on a massive dataset? #126

Closed deatherving closed 7 years ago

deatherving commented 7 years ago

I have a very very larget dataset and I'm using PQ for similarity search. However, it is impossible read the whole dataset into the memory for training, instead I have to read one slice of the data and train it, then the second slice. In the source code I found Train_hot_start and Train_shared training type but I don't which one is suitable for my situation which is training data slice once a time. Could you please briefly show me how these two training type used in the massive dataset?

mdouze commented 7 years ago

It does not make sense to train a ProductQuantizer object on more than about 256*1000 training vectors. It is based on a kmeans quantizer with a small number of centroids (256) so the effect of adding more training vectors is only to slow down training without any impact on the quantization error. Therefore, if you have a "massive dataset", just sample 256000 vectors out of it and train with that.

deatherving commented 7 years ago

Thank you very much.

sundl123 commented 7 years ago

hi, modouze

Acoording to your answer , does it mean that *number of optimal training vectors = ncentroids 1000** for a kmeans quantizer how do you arrive at this conclusion? Is there any mathmatical proof?

mdouze commented 7 years ago

kmeans is an inherently approximate algorithm to a NP-hard combinatorial optimization problem.

To see the effect of adding more training data, you can plot the quantization error for a held-out set an train with different training set sizes. The quantization error will be a bit jittery and will saturate soon (probably at a very safe margin below k*1000 vectors).

sundl123 commented 7 years ago

Thanks for replying me!

adityapatadia commented 6 years ago

This does not have any centroids specification. How to decide training sample? https://rawgit.com/facebookresearch/faiss/master/docs/html/structfaiss_1_1IndexPQ.html#a4fa05430935a02b62b7c15c9840c42fe

mdouze commented 6 years ago

In this case the number of centroids is pq.ksq.

adityapatadia commented 6 years ago

Thanks for quick response. I am sorry but the API doc does not mention that parameter. How do I access it from python? A sample code will be much appreciated.

mdouze commented 6 years ago

index.pq.ksq

adityapatadia commented 6 years ago

Okay. I will try.