Closed asfimport closed 2 years ago
Adrien Grand (@jpountz) (migrated from JIRA)
Performance can be unpredictable. If deletions are random, it shouldn't have a huge effect. But in the worst case, a segment could have 50% deleted docs, and they all happen to be near the query vector. HNSW would need to traverse through around half the entire graph to collect neighbors.
FWIW this is a general problem with Lucene. For instance if you run a term query, we'll use impacts to know which blocks may contain competitive documents, but we could hit a worst-case scenario where the documents that make the block competitive are deleted, and all the non-deleted documents of the block are not competitive.
Michael Sokolov (@msokolov) (migrated from JIRA)
This approach (filter out deletions while traversing the graph) makes sense to me. The alternatives I can think of are worse, performance-wise anyway. This will also set us up to filter based on any other criteria (find the K nearest vectors subject to a constraint) that we can express as a Bits covering the docid space. I'm starting to work on a Query interface, and will punt on deletions there, assuming it will get incorporated into the searchNearestVectors
Julie Tibshirani (@jtibshirani) (migrated from JIRA)
I opened https://github.com/apache/lucene/pull/239 to show the approach. I ran against some datasets ann-benchmarks (with no deletions) and it showed minimal performance impact. I haven't run performance tests with deletions, but plan to do that too.
ASF subversion and git services (migrated from JIRA)
Commit 6993fb9a9985372b0f0984b8bdd7434aaa33ad26 in lucene's branch refs/heads/main from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=6993fb9
LUCENE-10040: Handle deletions in nearest vector search (#239)
This PR extends VectorReader#search to take a parameter specifying the live docs. LeafReader#searchNearestVectors then always returns the k nearest undeleted docs.
To implement this, the HNSW algorithm will only add a candidate to the result set if it is a live doc. The graph search still visits and traverses deleted docs as it gathers candidates.
ASF subversion and git services (migrated from JIRA)
Commit 782c3cca3a55d75673da2650f69d73fdecc88d4d in lucene's branch refs/heads/main from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=782c3cc
LUCENE-10040: Relax TestKnnVectorQuery#testDeletes assertion (#251)
TestKnnVectorQuery#testDeletes assumes that if there are n total documents, we can perform a kNN search with k=n and retrieve all documents. This isn't true with our implementation – due to randomization we may select less than n entry points and never visit some vectors.
ASF subversion and git services (migrated from JIRA)
Commit bc161e6dcc0d200c115992eda43cd7d06074184e in lucene's branch refs/heads/main from Mayya Sharipova https://gitbox.apache.org/repos/asf?p=lucene.git;h=bc161e6
LUCENE-10040 Correct TestHnswGraph.testSearchWithAcceptOrds (#277)
If we set numSeed = 10, this test fails sometimes because it may mark expected results docs (from 0 to 9) as deleted which don't end up being retrieved, resulting in a low recall
Relates to #239
Julie Tibshirani (@jtibshirani) (migrated from JIRA)
I ran some local tests to sanity check the behavior with deletions. First, I ran the glove-100-angular benchmark (described in https://issues.apache.org/jira/browse/LUCENE-9937), with efConst=100, M=32, force-merged to one segment. I randomly deleted 20% of docs and checked the performance. Note: recall is measured against the true nearest neighbors from the non-deleted set (I made sure to recompute these true neighbors). The performance looks good:
NumCands Recall DistComps QPS
**No deletions**
10 0.459 505 4985.346
50 0.724 1445 2396.820
80 0.776 2053 1758.134
100 0.797 2437 1506.489
500 0.912 9232 418.167
800 0.934 13862 277.354
**20% deletions**
10 0.488 576 4587.963
50 0.741 1695 1966.009
80 0.789 2424 1220.389
100 0.810 2883 337.447
500 0.919 11088 221.782
800 0.941 16643 179.779
I also played around with challenging cases, where the deleted documents happen to be the vectors closest to the query. In the datasets I tried, HNSW seems pretty robust in this situation and maintains good recall. Here's a simple test that exercises it: https://github.com/apache/lucene/pull/527
I plan to close this out. It'd be great to add benchmarks with deletions to luceneutil, but that can be tracked separately.
Xi Caoqiu (migrated from JIRA)
Weaviate have done similar method to implement pre-filter, and the recall from their experiments looks very promising. https://www.semi.technology/developers/weaviate/current/architecture/prefiltering.html#post-filtering-vs-pre-filtering
Julie Tibshirani (@jtibshirani) (migrated from JIRA)
Thanks for posting, I found Weaviate's blog helpful as I was thinking through this issue!
ASF subversion and git services (migrated from JIRA)
Commit 5d39bca87a44f51e5d556bb0a7e8c28df3f539fa in lucene's branch refs/heads/main from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=5d39bca
LUCENE-10040: Add test for vector search with skewed deletions (#527)
This exercises a challenging case where the documents to skip all happen to be closest to the query vector. In many cases, HNSW appears to be robust to this case and maintains good recall.
ASF subversion and git services (migrated from JIRA)
Commit 394472d4b8e40504f0521df340df446089a7afff in lucene's branch refs/heads/branch_9x from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=394472d
LUCENE-10040: Add test for vector search with skewed deletions (#527)
This exercises a challenging case where the documents to skip all happen to be closest to the query vector. In many cases, HNSW appears to be robust to this case and maintains good recall.
ASF subversion and git services (migrated from JIRA)
Commit 9b6d417d1cae01688d73a67070b3760bf893a8e5 in lucene's branch refs/heads/main from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=9b6d417
LUCENE-10040: Update HnswGraph javadoc related to deletions
Previously it claimed the search method did not handle deletions.
ASF subversion and git services (migrated from JIRA)
Commit 00a7d5f1708a6ce43960d1b8b66f2aebd9568fc6 in lucene's branch refs/heads/branch_9x from Julie Tibshirani https://gitbox.apache.org/repos/asf?p=lucene.git;h=00a7d5f
LUCENE-10040: Update HnswGraph javadoc related to deletions
Previously it claimed the search method did not handle deletions.
Currently nearest vector search doesn't account for deleted documents. Even if a document is not in
LeafReader#getLiveDocs
, it could still be returned fromLeafReader#searchNearestVectors
. This seems like it'd be surprising + difficult for users, since other search APIs account for deleted docs. We've discussed extending the search logic to take a parameter likeBits liveDocs
. This issue discusses options around adding support.One approach is to just filter out deleted docs after running the KNN search. This behavior seems hard to work with as a user: fewer than
k
docs might come back from your KNN search!Alternatively,
LeafReader#searchNearestVectors
could always return thek
nearest undeleted docs. To implement this, HNSW could omit deleted docs while assembling its candidate list. It would traverse further into the graph, visiting more nodes to ensure it gathers the required candidates. (Note deleted docs would still be visited/ traversed). The hnswlib library contains an implementation like this, where you can mark documents as deleted and they're skipped during search.This approach seems reasonable to me, but there are some challenges:
Background links:
Migrated from LUCENE-10040 by Julie Tibshirani (@jtibshirani), resolved Dec 08 2021 Pull requests: https://github.com/apache/lucene/pull/239, https://github.com/apache/lucene/pull/251