apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.66k stars 1.03k forks source link

Implement KNN Query [LUCENE-9614] #10654

Closed asfimport closed 3 years ago

asfimport commented 3 years ago

Now we have a vector index format, and one vector indexing/KNN search implementation, but the interface is low-level: you can search across a single segment only. We would like to expose a Query implementation. Initially, we want to support a usage where the KnnVectorQuery selects the k-nearest neighbors without regard to any other constraints, and these can then be filtered as part of an enclosing Boolean or other query.

Later we will want to explore some kind of filtering while performing vector search, or a re-entrant search process that can yield further results. Because of the nature of knn search (all documents having any vector value match), it is more like a ranking than a filtering operation, and it doesn't really make sense to provide an iterator interface that can be merged in the usual way, in docid order, skipping ahead. It's not yet clear how to satisfy a query that is "k nearest neighbors satsifying some arbitrary Query", at least not without realizing a complete bitset for the Query. But this is for a later issue; this issue is just about performing the knn search in isolation, computing a set of (some given) K nearest neighbors, and providing an iterator over those.


Migrated from LUCENE-9614 by Michael Sokolov (@msokolov), 1 vote, resolved Aug 13 2021 Pull requests: https://github.com/apache/lucene/pull/235, https://github.com/apache/lucene/pull/244

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I wonder if we should use the Query API at all for nearest-neighbor search. Today the Query API assumes that you can figure out whether a document matches in isolation, regardless of other matches in the index/segment. Maybe we should have a new top-level API on IndexSearcher, something like IndexSearcher#nearestNeighbors(String field, float[] target), which we could later expand into IndexSearcher#nearestNeighbors(String field, float[] target, Query filter) to add support for filtering?

asfimport commented 3 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Today the Query API assumes that you can figure out whether a document matches in isolation, regardless of other matches in the index/segment

I'm not sure what you mean there - BooleanQuery for example relies on matches from other queries. Do you mean because of Weight.matches? Something else? I guess we could just return MATCH_WITH_NO_TERMS in such cases?

There is a similar API in ` FloatPointNearestNeighbor.nearest(IndexSearcher ...) although that does not accept aQuery`. Maybe that can be a model.

OTOH it's easy enough to write a query that capture the closest K hits up front and presents them as an iterator, so if someone wants a Query for convenience they can do something like this thing I posted for benchmarking purposes. I don't see the harm in offering such a thing?

We could equally well have a query like KnnVectorQuery(int target, int speedAccuracyTradeoff, Query filter)? I'm not sure what the plus/minus of the two approaches would be of this versus the APIs that accept (or are implemented by) IndexSearcher

asfimport commented 3 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

OK, thought about this a bit, and I guess I see the point a little better. This query is weird because if (say) we were to add some new vectors to the index, suddenly a vector that previously matched might no longer match. I guess I have been thinking of a Query as a convenience for plugging in to the typical scoring / execution framework provided by IndexSearcher. Let me sketch out the use case I have in mind, because I'm not sure how we would handle it in the non-Query implementation(s).

We'd like to be able to blend matches derived from postings (full text search) along with matches derived from vectors, using some kind of scoring function that balances vector scores and text relevance scores. Both kinds of matches also need to satisfy other constraints, embodied in a Query. If we present KNN matches as a Query, I think this can all be done by the Collectors in the usual way, but if we have a different API, say something on IndexSearcher, or a static method on a KNN class, then that blending will require its own custom implementation - I think?

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I wonder if the Query could be just a map from N doc IDs to scores, and the KNN search would actually be run to construct the Query, not as part of running the Query. This way we could still blend scores via BooleanQuery or FeatureField, and even things like block-max WAND would still work.

asfimport commented 3 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

> I wonder if the Query could be just a map from N doc IDs to scores, and the KNN search would actually be run to construct the Query, not as part of running the Query

I think that captures basically how such a Query would work - one pass over the index to gather documents, then participate in "normal" query processing. I did write such a GlobalDocIdQuery to support this at one point during experimentation. One drawback there is you need a special query-formation process; it wouldn't be enough to run a query parser over an incoming serialized query. We could also consider performing the knnSearch during Query.rewrite() somewhat analogous to how MultiTermQueries work. I guess I'm not sure why we need to be concerned about the issue that the Query's matches are dependent on the index content. Is that an invariant that we rely on anywhere? Caching maybe?

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Yes, caching relies on it though we always use rewritten queries as cache keys so I might be over-zealous as it doesn't look like what you are considering would be worse than all the rewrite methods we have that select terms to keep based on global index statistics.

asfimport commented 3 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Doing nn vector search during rewrite has one significant drawback, which is that rewrite() cannot make use of IndexSearcher's executor to perform concurrent searches across segments (or slices), whereas an implementation that does the search in createWeight will naturally get executed concurrently when IndexSearcher is configured for that.

To fix that would require some substantial change to pass an executor to Query.rewrite, which seems kind of overkill at this point. Instead, perhaps we can implement the createWeight version that supports concurrency and define equals(Object) and hashCode() to use object identity in order to prevent spurious caching.

asfimport commented 3 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Thinking about how to make the scores be commensurate across different indexes for the same query ... in the case of dot product there's no issue since we assume all vectors are unit length (otherwise the dot-product similarity makes no sense), scores are always between 0 and 1 and there is no need for inversion or normalization. For the Euclidean distance, because we invert the scores to negative in order to sort descending, we need some way to normalize to make them non-negative.

And – it's not really clear at all how to control the range of scores from this query given the typical use case of a boolean query disjunctively combining "semantic" matches from HNSW with "keyword" matches from term queries. Ideally we'd return scores in a fixed range (0 - 1) and let the query writer control the balance between keyword and semantic queries with the boost.

Possibly for these L2-normed queries, we can use something like

score(q, d) = 1 - |q - d| / (|q| + |d|)

Then as d -> 0 or d -> ∞, the score approaches 0, and score = 1 when q = d.

asfimport commented 3 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

UPDATE: we went with

score(q,d) = 1 / (1 + |q - d|)

to normalize "reversed" scores into [0, 1]: thanks for the suggestion, @jtibshirani

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit 624560a3d7060a1ae9f9b647b229590f28c4aa24 in lucene's branch refs/heads/main from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene.git;h=624560a

LUCENE-9614: add KnnVectorQuery implementation

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit e48be684b227dd483cc8e6e1a90535dc729f6473 in lucene's branch refs/heads/main from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=e48be68

LUCENE-9614: Prevent TestKnnVectorQuery from using simple text codec (#244)

The simple text codec doesn't support kNN searches, so the test will fail when we randomly chose to use it.

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit 29ed3908eae910de51ecba5f9ad9237d1c03e748 in lucene's branch refs/heads/main from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=29ed390

LUCENE-9614: Small fixes to KnnVectorQuery hashCode and toString

asfimport commented 3 years ago

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

This test reproducibly fails on main branch:

./gradlew test --tests TestKnnVectorQuery.testRandom -Dtests.seed=3B6CE0E105431E18 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=en-NF -Dtests.timezone=Europe/Andorra -Dtests.asserts=true -Dtests.file.encoding=UTF-8

 

java.lang.AssertionError: expected:<100> but was:<94> ... at org.apache.lucene.search.TestKnnVectorQuery.testRandom(TestKnnVectorQuery.java:298)

asfimport commented 3 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

@mayya-sharipova I opened #11107 to discuss the issue.

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit abd5ec4ff0b56b1abfc2883e47e75871e60d3cad in lucene's branch refs/heads/main from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=abd5ec4

LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0 (#413)

When the reader has no live docs, KnnVectorQuery can error out. This happens because IndexReader#numDocs is 0, and we end up passing an illegal value of k = 0 to the search method.

This commit removes the problematic optimization in KnnVectorQuery and replaces with a lower-level based on the total number of vectors in the segment.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit 18f9d31608654f88b2f048c4fd6fbdfedb3b6f95 in lucene's branch refs/heads/main from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=18f9d31

LUCENE-9614: Fix rare TestKnnVectorQuery failures

Some of our checks relied on doc IDs corresponding to the order in which docs were passed to IndexWriter. This is fragile and sometimes resulted in failures. Now we check against an "id" field instead.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit e924d48b6a87ca0e52e66d49cade46371393972a in lucene's branch refs/heads/branch_9x from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=e924d48

LUCENE-9614: Fix rare TestKnnVectorQuery failures

Some of our checks relied on doc IDs corresponding to the order in which docs were passed to IndexWriter. This is fragile and sometimes resulted in failures. Now we check against an "id" field instead.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit 22a9e45f096f87bf3f3dbaf866c34b472e1f8da3 in lucene's branch refs/heads/branch_9_1 from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=22a9e45

LUCENE-9614: Fix rare TestKnnVectorQuery failures

Some of our checks relied on doc IDs corresponding to the order in which docs were passed to IndexWriter. This is fragile and sometimes resulted in failures. Now we check against an "id" field instead.