Handle deletions in nearest vector search [LUCENE-10040]

asfimport commented 2 years ago

Currently nearest vector search doesn't account for deleted documents. Even if a document is not in LeafReader#getLiveDocs, it could still be returned from LeafReader#searchNearestVectors. This seems like it'd be surprising + difficult for users, since other search APIs account for deleted docs. We've discussed extending the search logic to take a parameter like Bits liveDocs. This issue discusses options around adding support.

One approach is to just filter out deleted docs after running the KNN search. This behavior seems hard to work with as a user: fewer than k docs might come back from your KNN search!

Alternatively, LeafReader#searchNearestVectors could always return the k nearest undeleted docs. To implement this, HNSW could omit deleted docs while assembling its candidate list. It would traverse further into the graph, visiting more nodes to ensure it gathers the required candidates. (Note deleted docs would still be visited/ traversed). The hnswlib library contains an implementation like this, where you can mark documents as deleted and they're skipped during search.

This approach seems reasonable to me, but there are some challenges:

Performance can be unpredictable. If deletions are random, it shouldn't have a huge effect. But in the worst case, a segment could have 50% deleted docs, and they all happen to be near the query vector. HNSW would need to traverse through around half the entire graph to collect neighbors.
As far as I know, there hasn't been academic research or any testing into how well this performs in terms of recall. I have a vague intuition it could be harder to achieve high recall as the algorithm traverses areas further from the "natural" entry points. The HNSW paper doesn't mention deletions/ filtering, and I haven't seen community benchmarks around it.

Background links:

Thoughts on deletions from the author of the HNSW paper: https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892
Blog from Vespa team which mentions combining KNN and search filters (very similar to applying deleted docs): https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/. The "Exact vs Approximate" section shows good performance even when a large percentage of documents are filtered out. The team mentioned to me they didn't have the chance to measure recall, only latency.

Migrated from LUCENE-10040 by Julie Tibshirani (@jtibshirani), resolved Dec 08 2021 Pull requests: https://github.com/apache/lucene/pull/239, https://github.com/apache/lucene/pull/251

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Performance can be unpredictable. If deletions are random, it shouldn't have a huge effect. But in the worst case, a segment could have 50% deleted docs, and they all happen to be near the query vector. HNSW would need to traverse through around half the entire graph to collect neighbors.

FWIW this is a general problem with Lucene. For instance if you run a term query, we'll use impacts to know which blocks may contain competitive documents, but we could hit a worst-case scenario where the documents that make the block competitive are deleted, and all the non-deleted documents of the block are not competitive.

asfimport commented 2 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

This approach (filter out deletions while traversing the graph) makes sense to me. The alternatives I can think of are worse, performance-wise anyway. This will also set us up to filter based on any other criteria (find the K nearest vectors subject to a constraint) that we can express as a Bits covering the docid space. I'm starting to work on a Query interface, and will punt on deletions there, assuming it will get incorporated into the searchNearestVectors

asfimport commented 2 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

I opened https://github.com/apache/lucene/pull/239 to show the approach. I ran against some datasets ann-benchmarks (with no deletions) and it showed minimal performance impact. I haven't run performance tests with deletions, but plan to do that too.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit 6993fb9a9985372b0f0984b8bdd7434aaa33ad26 in lucene's branch refs/heads/main from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=6993fb9

LUCENE-10040: Handle deletions in nearest vector search (#239)

This PR extends VectorReader#search to take a parameter specifying the live docs. LeafReader#searchNearestVectors then always returns the k nearest undeleted docs.

To implement this, the HNSW algorithm will only add a candidate to the result set if it is a live doc. The graph search still visits and traverses deleted docs as it gathers candidates.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit 782c3cca3a55d75673da2650f69d73fdecc88d4d in lucene's branch refs/heads/main from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=782c3cc

LUCENE-10040: Relax TestKnnVectorQuery#testDeletes assertion (#251)

TestKnnVectorQuery#testDeletes assumes that if there are n total documents, we can perform a kNN search with k=n and retrieve all documents. This isn't true with our implementation – due to randomization we may select less than n entry points and never visit some vectors.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit bc161e6dcc0d200c115992eda43cd7d06074184e in lucene's branch refs/heads/main from Mayya Sharipova https://gitbox.apache.org/repos/asf?p=lucene.git;h=bc161e6

LUCENE-10040 Correct TestHnswGraph.testSearchWithAcceptOrds (#277)

If we set numSeed = 10, this test fails sometimes because it may mark expected results docs (from 0 to 9) as deleted which don't end up being retrieved, resulting in a low recall

set numSeed to 10 to ensure 10 results are returned
add startIndex paramenter to createRandomAcceptOrds that allows documents before startIndex to be NOT deleted
use startIndex equal to 10 for createRandomAcceptOrds

Relates to #239

asfimport commented 2 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

I ran some local tests to sanity check the behavior with deletions. First, I ran the glove-100-angular benchmark (described in https://issues.apache.org/jira/browse/LUCENE-9937), with efConst=100, M=32, force-merged to one segment. I randomly deleted 20% of docs and checked the performance. Note: recall is measured against the true nearest neighbors from the non-deleted set (I made sure to recompute these true neighbors). The performance looks good:

NumCands  Recall   DistComps  QPS
**No deletions**
10        0.459     505       4985.346
50        0.724    1445       2396.820
80        0.776    2053       1758.134
100       0.797    2437       1506.489
500       0.912    9232        418.167
800       0.934   13862        277.354

**20% deletions**
10        0.488     576      4587.963
50        0.741    1695      1966.009
80        0.789    2424      1220.389
100       0.810    2883       337.447
500       0.919   11088       221.782
800       0.941   16643       179.779

I also played around with challenging cases, where the deleted documents happen to be the vectors closest to the query. In the datasets I tried, HNSW seems pretty robust in this situation and maintains good recall. Here's a simple test that exercises it: https://github.com/apache/lucene/pull/527

I plan to close this out. It'd be great to add benchmarks with deletions to luceneutil, but that can be tracked separately.

asfimport commented 2 years ago

Xi Caoqiu (migrated from JIRA)

Weaviate have done similar method to implement pre-filter, and the recall from their experiments looks very promising. https://www.semi.technology/developers/weaviate/current/architecture/prefiltering.html#post-filtering-vs-pre-filtering

asfimport commented 2 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

Thanks for posting, I found Weaviate's blog helpful as I was thinking through this issue!

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit 5d39bca87a44f51e5d556bb0a7e8c28df3f539fa in lucene's branch refs/heads/main from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=5d39bca

LUCENE-10040: Add test for vector search with skewed deletions (#527)

This exercises a challenging case where the documents to skip all happen to be closest to the query vector. In many cases, HNSW appears to be robust to this case and maintains good recall.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit 394472d4b8e40504f0521df340df446089a7afff in lucene's branch refs/heads/branch_9x from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=394472d

LUCENE-10040: Add test for vector search with skewed deletions (#527)

This exercises a challenging case where the documents to skip all happen to be closest to the query vector. In many cases, HNSW appears to be robust to this case and maintains good recall.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit 9b6d417d1cae01688d73a67070b3760bf893a8e5 in lucene's branch refs/heads/main from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=9b6d417

LUCENE-10040: Update HnswGraph javadoc related to deletions

Previously it claimed the search method did not handle deletions.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit 00a7d5f1708a6ce43960d1b8b66f2aebd9568fc6 in lucene's branch refs/heads/branch_9x from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=00a7d5f

LUCENE-10040: Update HnswGraph javadoc related to deletions

Previously it claimed the search method did not handle deletions.

apache / lucene

Handle deletions in nearest vector search [LUCENE-10040] #11078