Re-explore the logic around when Vector search should be Exact

apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.51k stars 995 forks source link

Re-explore the logic around when Vector search should be Exact #12505

Open benwtrent opened 12 months ago

benwtrent commented 12 months ago

Description

Lucene always does an approximate nearest neighbors search when no filter is provided.

This seems like unnecessary work. Some benchmarks would have to be done, but some ideas I had around options to explore:

Why not always do exact when maxDoc < k?
Should the "when to do exact" calculation consider byte vs float vectors?

It seems weird to go through all the work of going to the graph if there are only 10 documents in a segment.

benwtrent commented 9 months ago

One thing to consider is that we should test some various graphs to see how many vectors we actually visit.

I suspect its around Math.log(graphSize) * vectorsCollected. We could use some derivative of this equation to determine if we should use brute force or not.

For testing purposes, since we want to verify graph exploration, etc. all work without having to build huge indices, we might have an "override" to always do HNSW search first.

benwtrent commented 6 months ago

As pointed out in other issue conversations, Cassandra keeps track of the visited ratio over the lifetime of the index and its searches: https://github.com/apache/cassandra/blob/2f2bb70ccb2657a75abc5aa691cfa28924f98d10/src/java/org/apache/cassandra/index/sai/disk/v1/segment/VectorIndexSegmentSearcher.java#L288

I am not sure if this is a thing we want to do in Lucene directly or not.

Another option is to keep what we are doing (drop out of graph search if we see we are worse than brute-force).

benwtrent commented 6 months ago

Note, we no longer hit the graph when asking for more docs than are in the segment (even when not using a filter)

https://github.com/apache/lucene/pull/12806

But there is still the open question around improving our brute-force logic when filtering.