apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.67k stars 1.03k forks source link

Allow KnnVectorQuery to operate over a subset of liveDocs [LUCENE-10382] #11418

Closed asfimport closed 2 years ago

asfimport commented 2 years ago

Currently the KnnVectorQuery selects the top K vectors from all live docs. This ticket will change the interface to make it possible for the top K vectors to be selected from a subset of the live docs.


Migrated from LUCENE-10382 by Joel Bernstein (@joel-bernstein), resolved Mar 22 2022 Linked issues:

Pull requests: https://github.com/apache/lucene/pull/618, https://github.com/apache/lucene/pull/656

asfimport commented 2 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

How would this look? An easy first step is to add a filter parameter to KnnVectorQuery

  public KnnVectorQuery(String field, float[] target, int k, Bits filter)

then it can call LeafReader.searchNearestVectors with liveDocs.intersect(filter) instead of liveDocs.

@jtibshirani shared on list a link to a paper showing how the search degenerates for highly selective filters. The writers' approach was to fall back to "brute force" KNN when selectivity passes a fixed threshold. We could do that too, and it makes sense to me, but I guess the question is: where should this fallback happen in the API?

The implementation of full (non-approximate) KNN (with a filter) only needs the VectorValues iterator which the KnnVectorsReader already provides. It could be implemented as part of KnnVectorQuery. Is there a better place?

Hmm, previously @jpountz had suggested this API:

   searchNearestNeighbors(String field, float[] target, Query filter)

taking a Query rather than a Bitsas a filter. I guess that is more high-level, but in my mind a typical use case would be to use this with a precomputed filter (as a pre-filter) and then post filter by embedding this KnnVectorQuery in another BooleanQuery. And given that we have to compute a full bitset in advance anyway, exposing a Query interface seems a bit like over-promising. It's clearer to just provide Bits, I think?

Another open question is how to determine the threshold for cutting over to full KNN, and whether that will be user configurable at all. Ideally we can just pick a percent coverage and make it fixed.

Hmm I just realized another possible concern is that the vectors themselves may not be dense in the documents, and that will impact the coverage of the filter bits. So to get an accurate coverage number, we'd in theory have to fully intersect the KnnVector bits (which docs have vectors in the graph) with the filter and the liveDocs, and compare the cardinality with that of the graph. Although maybe an approximation here is enough - intersect a subset of the bits to estimate the total coverage?

asfimport commented 2 years ago

Joel Bernstein (@joel-bernstein) (migrated from JIRA)

I thought about passing in Bits but I don't think it will work because searchLeaf is at the segment level so we'd need segment level Bits.

We could pass in a Query and collect the segment level bits. Would be easy to implement as well.

KnnVectorQuery(String field, float[] target, int k, Query query)

asfimport commented 2 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

> I thought about passing in Bits but I don't think it will work because searchLeaf is at the segment level so we'd need segment level Bits.

Oh, good point!

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

We have queries like ParentChildrenBlockJoinQuery that take a BitSetProducer so that the query doesn't have to care about producing bit sets from queries, it's not its responsibility. In this case though, I think the decision should happen in Lucene, since it hopefully has more data to make the right decision than users have (e.g. Scorer#cost). If users knew in advance what filters they would like to apply, then they should split their indexes based on these filters instead of passing filters to Lucene.

I wonder if we could develop a cost model of both approaches assuming a random distribution of deletions, filter matches and vector values, so that the query could compute the cost in both cases for specific values of k}, Scorer#cost and LeafReader#numDeletedDocs (and maybe some HNSW-specific parameters like beamWidth?), and pick the approach that has the lesser cost.

LRUQueryCache already happens to cache dense filters (cost > maxdoc / 100) as bit sets, which helps with conjunctions for instance, so maybe we would be able to reuse it as a way to avoid recomputing bitsets over and over again for popular filters.

asfimport commented 2 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

If we go with a Query}-based filter, I guess it would still be possible to create a query wrapping a BitSetProducer (like TPBJQ), so it's not as if it's a hard decision preventing a customer providing a precomputed bitset - I think?

Also, relying on the cache makes sense to me, but I have some reservations. One issue we've found is that because it caches entire Query results, it can often miss significant caching opportunities, say when a complex BooleanQuery has a subset of clauses that can profitably be cached. Maybe the Query-writer can structure their queries to be more cache-friendly by nesting BQs? But then again they get rewritten and may be flattened prior to the cache seeing them.

Anyway maybe we can enhance the cache, but this is a separate issue;  +1 to move ahead using Query

asfimport commented 2 years ago

Joel Bernstein (@joel-bernstein) (migrated from JIRA)

I think Query makes sense as well. I'm a little fuzzy on the cost computation being discussed. Is this about the decision to do the ANN or fully materialized KNN? Or is there another cost being discussed that deals with the query being passed in as a filter?

asfimport commented 2 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

> I'm a little fuzzy on the cost computation being discussed. Is this about the decision to do the ANN or fully materialized KNN?

Yes. I wouldn't worry about that at first though. Maybe we can do three steps something like this:

  1. implement Query-based filter, always using HNSW search that we have today. It would have to be marked with some serious caveats about potential performance risk, but we should make progress somehow without insisting on the full implementation at once. Perhaps we can just document the risk, mark as experimental in javadoc?
  2. implement full KNN fallback with a fixed cutoff (based on Query cost?)
  3. implement an adaptive cost computation

also, maybe we're overthinking 3 and it's not really needed/simpler than we think?

asfimport commented 2 years ago

Joel Bernstein (@joel-bernstein) (migrated from JIRA)

I put up a PR with a Query as the filter. No tests yet, just a first look at a possible impl.

asfimport commented 2 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

@msokolov your multi-step approach makes sense to me. Maybe we could move forward with @joel-bernstein's PR as a way to get the query API down. I think we should introduce the fallback before releasing the change though. Otherwise KnnVectorQuery could easily degrade to brute force performance (or worse!), which could really catch users by surprise.

I've been pondering how to automatically choose when to fall back. It indeed seems best to compare the cost of HNSW vs. an exact scan, instead of choosing an arbitrary cut-off like "filter matches less than 10% of docs". Since the cost trade-off depends on data size and other factors, it doesn't seem possible to find a constant cutoff that works well across situations. To make it practical we might need to expose it as a parameter, which is not very user-friendly. So thinking about @jpountz's suggestion for a cost model...

I wonder if we could develop a cost model of both approaches assuming a random distribution of deletions, filter matches and vector values, so that the query could compute the cost in both cases for specific values of k, Scorer#cost and LeafReader#numDeletedDocs (and maybe some HNSW-specific parameters like beamWidth?), and pick the approach that has the lesser cost.

I played around with this, looking at HNSW latencies for various filter selectivities. It roughly scales with log (n) * 1/p, where p is the filter selectivity. (This sort of makes sense: HNSW is roughly logarithmic, but it needs to search more and more of the graph as nodes are filtered away.) But to compare to brute force, we need a pretty good handle on the scaling constants. These are really hard to pin down – they depend on the HNSW build parameters, the search parameter k, even properties of the dataset. Looking at the HNSW paper, it gives big-O complexity (under some theoretical assumptions) and doesn't show the impact of beamWidth or maxConn.

Instead of developing a static cost model, I wonder if we could try to make the choice dynamically:

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1 on figuring out a better execution path before the release, it's going to look bad if setting a filter could make the query perform many times slower than a linear scan.

I like the adaptive idea that would bound the overall cost at 2x the cost of a linear scan.

asfimport commented 2 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

> +1 on figuring out a better execution path before the release, it's going to look bad if setting a filter could make the query perform many times slower than a linear scan

This is fair, but by "release" do we mean – commit to lucene/main? I think so, because a 9.1 or later release could be cut from that at any time. So ... let's do the work on a feature branch to enable iterating to get it nailed down.

asfimport commented 2 years ago

Joel Bernstein (@joel-bernstein) (migrated from JIRA)

I'll keep working on the patch and add some tests and address the suggested changes. We can add the execution plan logic as it becomes more clear.

asfimport commented 2 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

What do you think about breaking it into two steps? These seem okay to ship on their own.

  1. Joel's PR, adding in a very simple fallback strategy. In the query we could check if the bit set would exclude more than 85% of documents, and if so, use an exact scan instead. Based on my experiments with random filters, 85% is conservative, and we're unlikely to see a bad degradation at that point. In the worst case, we do an exact scan when we didn't need to and check 15% of documents. We could document caveats like Mike mentions.
  2. Switch from a static check to a more robust one (maybe adaptive). I have some ideas here I'm excited to try out :)
asfimport commented 2 years ago

Joel Bernstein (@joel-bernstein) (migrated from JIRA)

I'm going to start the brute force implementation inside of KnnVectorQuery soon. My plan is to advance through the VectorValues using the bitsFilter and score the vectors with the KnnVectorField.vectorSimilarityFunction. If there is code available that does this already let me know.

Adding a brute force flag to KnnVectorQuery might be a good idea as well.

asfimport commented 2 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

I had some time to try out the dynamic check I mentioned, and it seems to work. I opened a PR here that builds off Joel's change: https://github.com/apache/lucene/pull/656. It's a draft because there are still some big open API questions. Looking forward to hearing your feedback!

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit 8ca372573dba0f4755b982b0c36a2b87aaf4705b in lucene's branch refs/heads/main from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=8ca3725

LUCENE-10382: Support filtering in KnnVectorQuery (#656)

This PR adds support for a query filter in KnnVectorQuery. First, we gather the query results for each leaf as a bit set. Then the HNSW search skips over the non-matching documents (using the same approach as for live docs). To prevent HNSW search from visiting too many documents when the filter is very selective, we short-circuit if HNSW has already visited more than the number of documents that match the filter, and execute an exact search instead. This bounds the number of visited documents at roughly 2x the cost of just running the exact filter, while in most cases HNSW completes successfully and does a lot better.

Co-authored-by: Joel Bernstein <jbernste@apache.org>

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit af40b448227e07e93d12c62f9dcf083b92f6eb51 in lucene's branch refs/heads/branch_9x from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=af40b44

LUCENE-10382: Support filtering in KnnVectorQuery (#656)

This PR adds support for a query filter in KnnVectorQuery. First, we gather the query results for each leaf as a bit set. Then the HNSW search skips over the non-matching documents (using the same approach as for live docs). To prevent HNSW search from visiting too many documents when the filter is very selective, we short-circuit if HNSW has already visited more than the number of documents that match the filter, and execute an exact search instead. This bounds the number of visited documents at roughly 2x the cost of just running the exact filter, while in most cases HNSW completes successfully and does a lot better.

Co-authored-by: Joel Bernstein <jbernste@apache.org>

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit b40a750aa8c0cc05291d8d8673d9d068d078d2de in lucene's branch refs/heads/main from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=b40a750

LUCENE-10382: Ensure kNN filtering works with other codecs (#700)

The original PR that added kNN filtering support overlooked non-default codecs. This follow-up ensures that other codecs work with the new filtering logic:

This PR also clarifies the limit checking logic for Lucene91HnswVectorsReader. Now we always check the limit before visiting a new node, whereas before we only checked it in an outer loop.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit 29d4adfe60368c0159cd0accd53efba77ca11771 in lucene's branch refs/heads/branch_9x from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=29d4adf

LUCENE-10382: Ensure kNN filtering works with other codecs (#700)

The original PR that added kNN filtering support overlooked non-default codecs. This follow-up ensures that other codecs work with the new filtering logic:

This PR also clarifies the limit checking logic for Lucene91HnswVectorsReader. Now we always check the limit before visiting a new node, whereas before we only checked it in an outer loop.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit d9c2e46824c8b5be8f471da6ce291e908cc58955 in lucene's branch refs/heads/main from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=d9c2e46

LUCENE-10382: Fix testSearchWithVisitedLimit failures

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit a3b136573fcb2a1e61dd70519708a5ef36d20eb8 in lucene's branch refs/heads/branch_9x from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=a3b1365

LUCENE-10382: Fix testSearchWithVisitedLimit failures

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit d47ff38d703c6b5da1ef9c774ccda201fd682b8d in lucene's branch refs/heads/main from Adrien Grand https://gitbox.apache.org/repos/asf?p=lucene.git;h=d47ff38

LUCENE-10382: Use IndexReaderContext#id to check reader identity. (#702)

KnnVectorQuery currently uses the index reader's hashcode to make sure that the query it builds runs on the right reader. We had added IndexContextReader#id a while back for a similar purpose with TermStates, let's reuse it?

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit d952b3a58114ce5a929211bca7a9b0e822658f35 in lucene's branch refs/heads/branch_9x from Adrien Grand https://gitbox.apache.org/repos/asf?p=lucene.git;h=d952b3a

LUCENE-10382: Use IndexReaderContext#id to check reader identity. (#702)

KnnVectorQuery currently uses the index reader's hashcode to make sure that the query it builds runs on the right reader. We had added IndexContextReader#id a while back for a similar purpose with TermStates, let's reuse it?

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Close after 9.1.0 release.