Open benwtrent opened 12 months ago
One thing to consider is that we should test some various graphs to see how many vectors we actually visit.
I suspect its around Math.log(graphSize) * vectorsCollected
. We could use some derivative of this equation to determine if we should use brute force or not.
For testing purposes, since we want to verify graph exploration, etc. all work without having to build huge indices, we might have an "override" to always do HNSW search first.
As pointed out in other issue conversations, Cassandra keeps track of the visited ratio over the lifetime of the index and its searches: https://github.com/apache/cassandra/blob/2f2bb70ccb2657a75abc5aa691cfa28924f98d10/src/java/org/apache/cassandra/index/sai/disk/v1/segment/VectorIndexSegmentSearcher.java#L288
I am not sure if this is a thing we want to do in Lucene directly or not.
Another option is to keep what we are doing (drop out of graph search if we see we are worse than brute-force).
Note, we no longer hit the graph when asking for more docs than are in the segment (even when not using a filter)
https://github.com/apache/lucene/pull/12806
But there is still the open question around improving our brute-force logic when filtering.
Description
Lucene always does an approximate nearest neighbors search when no filter is provided.
This seems like unnecessary work. Some benchmarks would have to be done, but some ideas I had around options to explore:
maxDoc < k
?byte
vsfloat
vectors?It seems weird to go through all the work of going to the graph if there are only 10 documents in a segment.