SubjectNoi commented 1 year ago

Summary

Hi all, I'm trying to study the detail of IVFPQ. As my understanding, Given N points and M clusters (DIM=128), I need to use some clustering algorithm to generate M clusters and M cluster centroids. Once a query arrived, I first compare the query with M centroids and pick the closest one, then I only need to compare the query with the points within the picked cluster by brute-force method to select, say, 1000 candidates to calculate the recall based on 100 true closest neighbors.

However, now I find that there are very few points (<5) in the picked cluster that lies in the true top 100. Currently, I used k-means to generate M clusters, is my choice proper (should I use kmeans or other clustering algorithm)? Or is there something I misunderstood? Thanks.

SubjectNoi commented 1 year ago

Currently, the dataset I use is generated by myself, 1048576 points, each with 128 dimensions, and each xi of 128 dimensions follows a uniform distribution of (0.0, 1.0). I find the example here (https://gist.github.com/mdouze/046c1960bc82801e6b40ed8ee677d33e), when I set the nlists to 128, and set nprobs to 1, it still gives me a recall of greater than 60%. So is my issue because of the clustering algorithm choice or the distribution of the dataset? Or is there a way to obtain the trained cluster and centroids (first-level quantizer) of a given dataset by some python API?

anubhav562 commented 1 year ago

@SubjectNoi

In your explanation you have completely missed about PQ -> Product Quantization. Your explanation just covers the IVF part. PQ is a lossy compression technique that is used to compress the vectors and help to store them in RAM. You can find a bunch of articles on google for the same.

facebookresearch / faiss

Question of IVFPQ #2535

Summary