Add refinement of quantized vector scores with fp distance calculations

jmazanec15 commented 1 month ago

Description

With quantization techniques that are compressing vectors in memory further and further, because of how much information is lost, recall is going to drop. However, with the current quantization support, we already store the full precision float vector values. With this, we could create a two phased search process that oversamples vectors from the quantized index (i.e. r*k results), and then lazily loads vectors from disk for the r*k results, re-scoring the vectors. This has been discussed directly or indirectly in a couple different places, but I figured itd make sense to create a separate issue for it:

13251
13468
12615

It has been shown to work with binary compression techniques and some PQ in a few different places (https://medium.com/qdrant/hyperfast-vector-search-with-binary-quantization-865052c5c259, https://huggingface.co/blog/embedding-quantization#binary-rescoring). We also have done some experiments in OpenSearch with IVFPQ and re-scoring and its shown pretty strong promise (assuming that the disk IOPS/throughput are strong enough - see https://github.com/opensearch-project/k-NN/issues/1779#user-content-appendix-b-baseline-rescore-experiments).

That being said, it can provide similar benefits to the DiskANN approach.

One challenge I foresee is the approach requires the quantized ANN index to be resident in memory. However, I am unsure if loading the full precision vectors from disk will cause the page cache to evict the quantized ANN index or if the page cache will naturally adapt to the access pattern. If it is the former, we would probably need some way to pin the quantized ANN index and its vectors in memory.

In the future, it could probably be fine tuned quite a bit with access pattern optimizations and/or some kind of hierarchical quantization and re-scoring.

benwtrent commented 1 month ago

I think the API would be tricky, but I am all for this idea. The ability to "approximately score" and then do second pass to "exact score" is useful for all sorts of levels of quantization.

Whatever the design, it would be most efficient to first gather the nearest vectors from ALL segments with an approximate score, and then do a second pass over all segments with to refine the top k.

Rescoring per segment would be needlessly in-efficient.

jmazanec15 commented 1 month ago

I think the API would be tricky, but I am all for this idea

Yes agree, Ill think on this a little bit. Ill start with a PoC and go from there.

Whatever the design, it would be most efficient to first gather the nearest vectors from ALL segments with an approximate score, and then do a second pass over all segments with to refine the top k.

Rescoring per segment would be needlessly in-efficient.

Yes I agree. However, with this, I think we would need to refine not the top k but the top r*k and then reduce to k. Otherwise, I dont think that recall would actually be improved - just ordering might be better.

benwtrent commented 1 month ago

@jmazanec15

I think we would need to refine not the top k but the top r*k and then reduce to k

100%, I wasn't clear. Yes, over all segments, we gather some approximate scoring that is r*k total when collapsed together, and then refine & rescore the total r*k.

jmazanec15 commented 1 month ago

Makes sense thanks @benwtrent . Im working on PoC and some experiments. Didnt realize that the full-precision vectors for quantized index are exposed via getFloatVectorValues. That should make things simpler.

I was also looking at this comment for DiskANN from @kevindrosendahl (https://github.com/apache/lucene/issues/12615#issuecomment-1868095892) and noticed that mlock'ing parts of the index in memory became really important in lower mem environments for performance. I dont think this is currently supported in Lucene, but will take a look.

apache / lucene

Add refinement of quantized vector scores with fp distance calculations #13564

Description

13251

13468

12615