apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.6k stars 1.01k forks source link

Init HNSW merge with graph containing deleted documents #12533

Open zhaih opened 1 year ago

zhaih commented 1 year ago

Description

Currently when we're merging HNSW graphs we're able to start with an exiting graph but not inserting nodes from scratch thanks to #12050. But we have set a constraint that the init graph cannot be the one that has deletions since we do not have a good way to handle deletions within a graph without reconstructing it. And this makes the nice optimization a bit wasteful as in reality the bigger the segment the less probability it has no deletion, especially in a hybrid text-embedding index.

Opening this issue to discuss how we could solve this problem or have a work around. My very naive ideas:

  1. We could remove the deleted node from the graph, fully connected all its neighbours, and do diverse check on those neighbors to remove extra links. In case the deleted node is an entry node, we can insert the closest node of the deleted node upto where the deleted node existed.
  2. We tolerate certain amount of deletions (like 10% ~ 20%) inside HNSW graph and just use them as connections.

I personally like 1 better as for 2 we need to change the searching behavior accordingly and sounds a bit fragile. Any other ideas are welcome!

mbrette commented 1 year ago

Approach (2) may not be so much a change. Let's say you mark those nodes as "tombstone" so that we know they are deleted in the graph itself. Exploring the graph with those tombstone would be the same as what we do today with the acceptOrds filter: skipping those nodes from the results, but keeping them in the candidate list to explore its neighbors.

The problem with that approach is if you have a large ratio of deleted documents, the graph becomes bloated. It could be part of an hybrid strategy where we keep those tombstones until we reach a threshold, after which we need to remove them - if we can - or rebuild the graph on the next merge.

jmazanec15 commented 1 year ago
  1. We could remove the deleted node from the graph, fully connected all its neighbours, and do diverse check on those neighbors to remove extra links. In case the deleted node is an entry node, we can insert the closest node of the deleted node upto where the deleted node existed.

What do you mean by fully connect its neighbors? Would this mean basically figure out the to be deleted nodes in-edges and reinsert them into the graph using normal edge selection strategy excluding the deleted nodes to "patch" the broken connections? We looked into this a little bit recently, but the number of reinserts grows pretty fast. It might be promising, though, to start finding replacement neighbors from the neighbor that is being removed (as opposed to starting from the global entry point). I think with this approach we would need to figure out a way to avoid quality drift after the graph has been manipulated in such a way over several generations - edge selection strategy is different from building the graph. For instance, refinement overtime may mean that the long distance hops neighbors added on early would start to disappear. Would the diversity check help in this case? Also, I think at a certain point, it will be better to just rebuild the graph from scratch, suggesting a threshold might need to be selected.

  1. We tolerate certain amount of deletions (like 10% ~ 20%) inside HNSW graph and just use them as connections.

There was some discussion around this in hnswlib: https://github.com/nmslib/hnswlib/issues/4#issuecomment-678315156. In practice, this probably would work well - but not really sure how to choose the correct number of deletions. But agree with @mbrette - might be good to take a hybrid approach.

jmazanec15 commented 1 year ago

Additionally, the FreshDiskANN paper did some work in this space. They ran a test for NSG where they iteratively repeat the following process a certain number of cycles and track the recall:

  1. delete 5% of the index
  2. patch the incident nodes that were impacted via local neighborhoods (similiar to @zhaih (1))
  3. reinsert the deleted nodes
  4. measure recall

They ran a similar one for HNSW where they do not patch the edges. In both cases, they saw some degradation:

Screenshot 2023-09-13 at 9 40 10 AM.

Their intuition for this happening is because of the graphs become sparser as this process happens, leading to less navigability. The graphs become sparser because the pruning policy is more strict.

In their system, they do employ a similar algorithm to @zhaih (1), where they connect the incident edges and prune based on some criteria that shows promise.