Poor recall using non-dense FAISS indexes

abhinavkulkarni commented 2 years ago

Hi,

I see that the BLINK codebase has support for dense FAISS indexes such as HNSW or flat one, however I was wondering if anybody has experimented with indexes that employ vector transform (PCA, OPQ) and/or quantization (PQ) and was able to get a comparable recall on entities retrieved.

For e.g., here are my two runs with dense and quantized indexes:

https://jsitor.com/4O8-J-aFf

and

https://jsitor.com/5LBMpVJwNa

I also tried with inner product metric (I made sure to normalize vectors while adding them to index and as well as searching), however poor results.

You can see I am able to find entity Aristotle using a dense index in first position, but unable to find it using a quantized index in top 500 positions. I experimented with various nlist, M, nbits and nprobe parameters, but no avail.

I understand that the candidates are encoded with candidate encoder while the mentions are encoded using a context encoder and they are two different models, so they may not produce exact same vector for same mention-context (or title-description) pair, however, why does dense index succeed in finding the candidate while the quantized one produces it at 366th position.

I cannot feasibly have such a high value of top_k and feed so many pairs to crossencoder.

Thanks!

tahmiddialpad commented 2 years ago

Hi @abhinavkulkarni, do you have any updates related to your experiments?

klimentij commented 2 years ago

I had similar issue. Fixed by running search with very high (>=1024) nprobe

facebookresearch / BLINK

Poor recall using non-dense FAISS indexes #108