facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
30.56k stars 3.56k forks source link

IMI+OPQ recall lower than IMI alone #1870

Closed WenqiJiang closed 2 months ago

WenqiJiang commented 3 years ago

Dear Faiss team,

I have been trying IMI + OPQ on CPU, using PQ16. Surprisingly, it seems the recall of IMI+OPQ is even lower than IMI alone on SIFT100M dataset. Is this what you would expect or is there anything I could do wrong?

The commands I used for training and querying:

python bench_polysemous_1bn.py SIFT100M IMI2x8,PQ16 nprobe=64 nprobe=128 nprobe=256 nprobe=512

python bench_polysemous_1bn.py SIFT100M OPQ16,IMI2x8,PQ16 nprobe=64 nprobe=128 nprobe=256 nprobe=512

Some recall numbers:

without OPQ:
loading ./trained_CPU_indexes/bench_cpu_SIFT100M_IMI2x12,PQ16/SIFT100M_IMI2x12,PQ16_populated.index
             R@1    R@10    
nprobe=64    0.3336 0.6466    
nprobe=128   0.3575 0.7211   
nprobe=256   0.3740 0.7784    
nprobe=512   0.3850 0.8201   

with OPQ:
loading ./trained_CPU_indexes/bench_cpu_SIFT100M_OPQ16,IMI2x12,PQ16/SIFT100M_OPQ16,IMI2x12,PQ16_populated.index
             R@1    R@10    
nprobe=64    0.3067 0.6196    
nprobe=128   0.3296 0.6875    
nprobe=256   0.3441 0.7415    
nprobe=512   0.3574 0.7834   

I show IMI2x12 here, but I also experienced the same situation from 2x8 to 2x14...

mdouze commented 3 years ago

Thanks for this test. It can happen that OPQ is counter-productive with IVF and IMI because it is not directly optimizing the encoding of the PQ (it does not take into account the effect of the coarse quantizer). It is possible to train it properly, I'll make a script to show how.

WenqiJiang commented 3 years ago

Thanks for this test. It can happen that OPQ is counter-productive with IVF and IMI because it is not directly optimizing the encoding of the PQ (it does not take into account the effect of the coarse quantizer). It is possible to train it properly, I'll make a script to show how.

Thanks for the kind reply! Has that script been available? Or are there any similar scripts that I could use as a reference? Just curious about how can I do the training properly :)

Best