Open asfimport opened 2 years ago
Adrien Grand (@jpountz) (migrated from JIRA)
I mentioned a potential indexing speedup in the issue description because indexing is what I was looking at, but this could also help search a bit since search uses this NeighborQueue too, especially with large values of k.
Doing some minor digging into this. From what I can tell there are two things that use the heap logic:
A radixheap COULD work for the nearest neighbor tracking as it is monotonic. But some initial testing shows that our candidate tracking isn't monotonic. We will pop candidates and add candidates and their changes will not be monotonic. Once we start removing items from the min_heap
, it is only for smaller items, or when we are removing to get the top_k.
Given the comment by @benwtrent is this issue still relevant?
@angadp I think it's more relevant given recent refactors. A radix heap may be possible now.
Thanks! I was looking for an open issue and will check this out.
I see what you were saying earlier now, Ben! So when we add the candidate we add them using outOfOrder
method which is non-montonic and then try to find the least diverse candidate. Hence we don't use a priority queue for maintaining the candidates.
The nearest neighbor tracking is monotonic.
Beginner questions here:
What was the decision behind adding these candidates outOfOrder
?
Can we not try to modify the method to do it inOrder
and then try to find the least diverse candidate to remove?
What was the decision behind adding these candidates
outOfOrder
?
Speed, once we know things are sorted, we know they have been checked for diversity. But with any optimization, measurement is important.
If you can get the same graph results but faster by adjusting how we track diverse candidates & using a radix heap, I say go forth and benchmark :)
Sorry for the late reply! I think my initial thinking was wrong.
I'm currently implementing this and will post the code along with the benchmarks by next week.
Now that @jtibshirani improved merging via #11411, nightly benchmarks report the following two top CPU consumers:
Exploration at #9028 suggested that radix heaps perform better than binary heaps if the heap is large: queries with 16 clauses or more got faster while queries with fewer clauses would get slower. With a default beam width of 100, I'd be interested in seeing how HNSW indexing performs if we replace the current binary LongHeap with a radix heap.
Migrated from LUCENE-10383 by Adrien Grand (@jpountz), updated Jan 22 2022