Closed deatherving closed 7 years ago
Hi
It does not make sense to train a ProductQuantizer
object on more than about 256*1000 training vectors. It is based on a kmeans quantizer with a small number of centroids (256) so the effect of adding more training vectors is only to slow down training without any impact on the quantization error. Therefore, if you have a "massive dataset", just sample 256000 vectors out of it and train with that.
Thank you very much.
hi, modouze
Acoording to your answer , does it mean that *number of optimal training vectors = ncentroids 1000** for a kmeans quantizer how do you arrive at this conclusion? Is there any mathmatical proof?
Hi
kmeans is an inherently approximate algorithm to a NP-hard combinatorial optimization problem.
To see the effect of adding more training data, you can plot the quantization error for a held-out set an train with different training set sizes. The quantization error will be a bit jittery and will saturate soon (probably at a very safe margin below k*1000 vectors).
Thanks for replying me!
This does not have any centroids specification. How to decide training sample? https://rawgit.com/facebookresearch/faiss/master/docs/html/structfaiss_1_1IndexPQ.html#a4fa05430935a02b62b7c15c9840c42fe
In this case the number of centroids is pq.ksq
.
Thanks for quick response. I am sorry but the API doc does not mention that parameter. How do I access it from python? A sample code will be much appreciated.
index.pq.ksq
Okay. I will try.
I have a very very larget dataset and I'm using PQ for similarity search. However, it is impossible read the whole dataset into the memory for training, instead I have to read one slice of the data and train it, then the second slice. In the source code I found Train_hot_start and Train_shared training type but I don't which one is suitable for my situation which is training data slice once a time. Could you please briefly show me how these two training type used in the massive dataset?