apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.45k stars 973 forks source link

Add a double addressing vector scorer #13370

Open ChrisHegarty opened 1 month ago

ChrisHegarty commented 1 month ago

This commit adds a method to RandomVectorScorerSupplier that allows to score two vectors based their ordinals.

The existing model of this API first creates a scorer, that effectively binds the ordinal of the first vector, to then score the ordinal of a second vector agains the first. This results in a RandomVectorScorer instance being created each time we want to score against a different vector in the first position. Allowing to score against two given ordinals avoids the creation of a RandomVectorScorer instance, which is likely expensive during graph building.

The new API could seen as a convenience for scorer(ord).score(node), or vice versa, but in fact there can be very different implementation characteristics of these - most notably the avoidance of the creation of a RandomVectorScorer instance.

This PR just updates a few places in the graph building to use the new API. Further analysis can determine where else this API would be beneficial.

ChrisHegarty commented 1 month ago

One thing the Scorer object gave us is caching of the single vector that is used many times.

The underlying Offheap vector objects cache the vector on heap and prevents multiple reads.

Anything that calls the same random vectors twice with different ordinals, and those random vectors are read on heap will suffer significant performance issues as that on heap cache is thrashed on every comparison.

I didn't review fully, but wanted to ensure this performance was a consideration.

Yes, this has been considered.

Since the cache is inside the VectorValues, then this behaviour remains the same - scoring of the same ordinal against the same VectorValues will read the cached value. However, there are cases where this may not be what you want. Creating a separate scorer and binding the initial ordinal can, and in some cases does, create a separate copy of the values. In fact, maybe we should be consistent here - the supplier should always carry separate copies for both the first and second ordinals.

github-actions[bot] commented 3 weeks ago

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!