apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.62k stars 1.02k forks source link

HnwsGraph creates disconnected components #12627

Closed nitirajrathore closed 1 month ago

nitirajrathore commented 1 year ago

Description

I work for Amazon Retail Product search and we are using Lucene KNN for semantic search of products. We recently noticed that the hnsw graphs generated are not always strongly connected and in worst case scenario some products may be undiscoverable. Connectedness of Hierarchical graph can be complicated, so below I am mentioning my experiment details.

Experiment:

I took the Lucene indexes from our production servers and for each segment (hnsw graph) I did following test. At each level graph I took the same entry point, the entry point of HNSW graph, checked how many nodes are reachable from this entrypoint. Note that connectedness at each level was checked independently of other levels. Sample code attached. My observations are as below.

Reproduciability

Although it is fairly easy to be reproduced with Amazon's internal data, but I found it really hard to reproduce it with random vectors. I will attached a simple main class of my test and the parameters that I tried. I only got some disconnectedness as listed in below table. For higher number of vectors, the execution becomes painfully slow because of hnsw graph creation. I will continue to work to find some parameters for reproducibilty and also explore open source datasets for the same.

DIS-L0 means: %age of disconnected nodes in graph at level-0 of HNSW graph.

randomness seed dimension no of doc vectors Max-conn Beam-width dis-l0 dis-l1 dis-l2 dis-l3 dis-l4
-219343918 256 100_000 16 100 0 0.0158 (1 node disconnected) 0 0 0
-245556271 256 100_000 16 100 0 0.0158 (1 node disconnected) 0 0 0
-766887769 256 1_000_000 16 100 0.0003 0.04166 0 0 0

No direct solution comes to my mind but as per my discussion in this email thread. I will look into the part where some connections are removed to maintain the Max-Conn property (diversity check etc.).

Version and environment details

Lucene version : 9.7

nitirajrathore commented 11 months ago

I was able to run tests on wiki dataset using the luceneutils package. The results shows that even with a single segment index and no updates, around 1% nodes gets disconnected for about 1M vectors. It would be great if someone else can have a look at the CheckHNSWConnectedness for correctness.

This may or may not be an issue for different system given that this is 'approximate' nearest neighbour search. But in my opinion it is worth exploring more and if possible some fix.

Next I will try to reproduce with multiple segments and try to find the cause and fix for it.

msokolov commented 11 months ago

Have you any results how connectivity varies with maxconn? I found this article that talks about connectivity when filtering; anyway it shows a suggestive graph that made me think maybe there is a non-linear behavior and we should try to fit maxconn to the dataset better. I also believe that this GloVe data is somewhat pathological - it doesn't have the best numerical properties. Maybe try re-running your test with some of the other wikipedia-based datasets we generated for luceneutil like the one using the minilm model?

Also - can you think of some way to preserve global connectivity while removing links? It seems somewhat intractable to me, but maybe there is some way?

msokolov commented 11 months ago

I'm thinking of something like this: first build a fully-connected graph (no link removal, no enforcement of maxconns). Then apply a global pruning algorithm of some sort that attempts to preserve connectivity. For example, you could iterate over all the vectors in the graph, searching for each one using a consistent entry point (node 0 in the graph say, since it will probably be maximally-connected), and "coloring" the visited nodes and edges during the searches. Stop once every node has been visited. Some of the edges will not have been visited, and we could prune those from nodes with large numbers of edges.

nitirajrathore commented 11 months ago

Thanks @msokolov : These are really good suggestions. I will try to incorporate these ideas in solutions. I think in the end there can be multiple ways to allow more connectivity.

I was also worried about how the connections are made and pruned. I checked these lines in our algorithm. Here we connect a node to the neighbour list only if is not closer to any of the current neighbour. Think of a situaion where for some nodes (say a) if the first neighbour connected (say b) has a good score but the rest of the candidate nodes have higher score with b than a. Then the rest of the candidates will NOT get connected to a AT ALL. Later if because of some other connection of b, say the non diverse connection of b -> a is removed then the a will get completely disconnected from graph. Basically we should have more nodes connected in the begining itself so that if some nodes are removed we do not run danger of disconnected graphs.

To check this I added lot of code to collect the events (like add, connect, disconnect) and was able to reproduce it with just 10K graph size itself with max-conn 16. Example below. But the same is not reproduced with max-conn 32 or max-conn 64. Although, increasing max-conn in this case helped but the basic algorithm for connections does not seem very good to me.

To fix this I checked that the original paper proposes the concept of keepPrunedConnections. I implemented that in my local and ran tests. But something is not right, I will keep checking if I did something wrong in the implementation etc. Will submit "draft" PR soon for others comments.

I think there is another issue in current implementation. Look at this code where we finally connected edge from b ---> a and now if b has more connections then max-conn then we are removing one of the non diverse connection of b Say we remove the b ---> c. So we are removing this half of the undirected connection but I don't think we are removing the other half c ---> b anywhere. This will leave inconsistent Graph. I tried fixing this as well, but I am getting some errors. I will try to fix and submit the draft PR.

Note that adding the concept of keepPrunedConnections will not just impact these disconnected nodes but in general the connections of all the nodes will increase and will become close to 'max-conn'. I think this is a good property without much additional cost at the time of graph creation. But the search time may increase a bit since there will be more connections to explore for the same max-conn. But I think this will also increase the Recall with ExactKNN. I will run the performance tests once code complete to find those numbers.


msokolov commented 11 months ago

So we are removing this half of the undirected connection but I don't think we are removing the other half c ---> b anywhere. This will leave inconsistent Graph

This is by design; the resultant graph is not expected to be symmetric (ie undirected, bidirectional). It is allowed to have a->b with no corresponding b->a

nitirajrathore commented 11 months ago

Hi @msokolov ! Thanks for clarifying. But I think it can help to remove the 'less important' edge from both sides, since it frees up a degree of "other" node to accept a new connection without indulging into diversity check. Most of the time the problem that I saw with diversity check is that, the most recently added connection to the node itself turns out to be non-diverse and gets removed immediately. Because of this the new incoming node does not even get enough connections to begin with in the graph. Even in the case where I removed the diversity check while connecting incoming node (a) to neighbours (say b,c) and allow full max-conn number of connections, most of the connections from neighbours (b->a) to the nodes are removed because of diversity check, leaving mostly half connections from (a->b). Now when a new node say x tries to connect to a even that is kicked off by a because of diversity check. So keeping half connections sometimes acts against the connectedness nature of hnsw. I have come up with a new huristic which seems to work well against disconnectedness, but I am yet to optimize it and conduct performance test on it. Will still update the algorithm here and post numbers later.

nitirajrathore commented 11 months ago

I was able to conduct some perf-test as well. I would like to propose following changes With example of adding node a to 'b, c, d'

node neighbours
b c, d, e
c b, d
e b, x, y, z

Proposed heuristic

  1. At the time of adding a node always add all (upto maxConn) edges to highest scoring neighbours (disregarding the diversity check) eg. adding edges a-->b, a-->c etc. Currently we check for diverse connections at this time also.
  2. At the time of adding reverse edges from neighbours to the node, if the given neighbour has more than maxConn number of connections then use new diversity check formula to calculate the edge that can be removed. For example at the time of adding edge b --> a if we see that b has more than maxConn connections then apply new hiuristic and try to remove one edge.
  3. Check for diversity only between the common neighbours of candidate node and the node. Example, if b has more than maxConn connections then we iterate over b's neighbours (which now includes 'a' as well) considering each one as a possible candidate for disconnection. When deciding if we should disconnect an edge say b-->c or not. We will consider the only the common neighbours of c and b for diversity check, so we will compare the score(c,b) with only score(c,d) and not with score(c,e) as only d is the common neighbour of c and b. If we find a non diverse node we return that, otherwise
  4. we return a node which has highest number of common neighbours with b, otherwise
  5. we return a node with maximum number of edges (eg. e in above case).
  6. If none of the above condition exists, then we return -1, meaning we should not remove any connection.
  7. If we receive a valid node for removal, we remove that and the other half of the connection i.e. we remove both bi-directional connections. so if c is returned as non-diverse node, then we remove both b---> c and c ---> b connections.
Performance number: Branch recall avgCpuTime numDocs reindexTimeMsec fanout maxConn beamWidth totalVisited selectivity prefilter
Baseline 0.451 17.78 1000000 370916 0 16 100 10 1.00 post-filter
Candidate 0.599 (33% increase) 21.63 (22% increase) 1000000 624797 (68% increase) 0 16 100 10 1.00 post-filter

Since now we are increasing the number of connections per node, we also see a good increase in the recall for same max-conn. For the same reason we do expect the avgCpuTime to increase at query time. I can work a little bit on improving the indexing time as the my current implementation is very naive and finds common nodes by two for loops and some other places recalculates the scores where not needed. But overall the indexing time will increase for sure.

Another data that I will post next is the number of (histogram) number of edges each node has. Expect it to increase from lower number to max-conn on average.

I will upload my (rough) draft PR now.

How I reached this heuristic:

Before coming up with this heuristic I tried quite some simpler ones, but in those cases I was either getting much much higher disconnections than the mainline or if very less and even just 1 disconnection, but some disconnection was happening. I am open for removing or adding any other condition which is considered for edge removal.

Initially, I tried with just (1) and left the rest of the algorithm same as in mainline. This "increased" the disconnections by large factor instead of decreasing it. The issue was that say 0 to 32 nodes are added with star kind of formation, now if 33rd node comes in, it will form connections with 1-32 nodes and each may find 0 as the non diverse node, so each one removes 0 leaving it disconnected from the graph. Other than this extreme case there were others too. That is where (3) came in handy where we don't check diverseness of an edge unless we are sure that there is an alternate path to the disconnecting node via a neighbouring node. This coupled with (7) reduced the disconnectedness from few thousand nodes to just couple of nodes at level-1 and only occassional disconnected node at level-0. Then to achieve no disconnection at all, I initially tried (1, 2, 3) with (6) to remove even slightest chance of creating disconnectedness, but with this we end up not removing any connection a large number of times and the number of connections per node increased from max-conn of 32 at level-0 to upwards of 250 for more than 5% of nodes. To control the number of connections I added (4) and (5) mostly to remove at least some connection without affecting the graph geometry by much. One might wonder if (7) is actually required, so I will run some test without (7). But I think it plays an important by freeing up some of the edges and allowing those nodes to accept un-conditional connections from other incoming node.

benwtrent commented 11 months ago

68% increase in index time is untenable, I would be a hard no on a change that slows things down this much. Maybe we can find something better.

@nitirajrathore i know https://github.com/nmslib/nmslib/blob/master/similarity_search/src/method/hnsw.cc has various heuristics, have you explored those to see how they can be modified or added here?

nitirajrathore commented 11 months ago

@benwtrent : I have added draft PR. The code is not at all optimized right now for performance and I am hoping to fix some obvious stuff and will post perf results here.

have you explored those to see how they can be modified or added here?

No, I haven't checked those. Thanks for pointing it out. I will check and revert.

benwtrent commented 10 months ago

@nitirajrathore @msokolov I had an idea around this, and it will cost an extra 4bytes per node on each layer its a member (maybe we only need this on the bottom layer...)

What if we added an "incoming connection" count for every node? Then when disconnecting non-diverse neighbors, we enforce that we can never disconnect the last connection to a node.

msokolov commented 10 months ago

Is the problem primarily to do with single isolated nodes or do we also see disconnected subgraphs containing multiple nodes? I think this idea would prevent the isolated nodes, but not fix the other case.

benwtrent commented 10 months ago

@msokolov good point.

It seems to me we would only fully disconnect a sub-graph only if its very clustered. Is there a way to detect this in the diversity selection?

One other thing I just found that confuses me: https://github.com/apache/lucene/pull/12235

This slightly changed the diversity connection in a way to to attempt to improve performance. The key issue is here: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java#L386-L407

If we have "checked the indices", we always just remove the furthest one, which doesn't seem correct to me. We should check for diversity starting at the furthest one, not always remove it?

@nitirajrathore for your connection checker, are you testing on a Lucene > 9.7.0?

msokolov commented 10 months ago

My memory of the way this diversity criterion has evolved is kind of hazy, but I believe in the very first implementation we would not impose any diversity check until the neighbor array was full? It seems as if that would have tended to preserve more links, but then we (I think?) decided that wasn't the approach advocated by the academic papers we were emulating, and implemented the diversity check to every added neighbor. Now we have this checked/unchecked distinction. I seem to remember that the checked ones have had a diversity check applied -- so we know that as a group, they are diverse (none of them is closer to any of the others than it is to the target node). So this would seem to support the idea that we can add unchecked nodes and then only check when the list gets full?? So, what are we doing?! Sorry, I've lost track.

benwtrent commented 10 months ago

@nitirajrathore could you add something to KnnGraphTester that is a test for connectedness?

If you can't get to it, I can try. It would be good to have it along side all our other performance testing as another metric to record and optimize for.

nitirajrathore commented 10 months ago

What if we added an "incoming connection" count for every node?

&

I think this idea would prevent the isolated nodes, but not fix the other case.

I will write a small code to check the type of disconnectedness. Clustered vs single

but then we (I think?) decided that wasn't the approach advocated by the academic papers we were emulating, and implemented the diversity check to every added neighbor

Paper proposes 3 different heuristics.

  1. check diversity before adding any node. -- Standard in paper. (getNeighborsByHeuristic2() in the code)
  2. check diversity before adding any node, but if the nodes added are lesser than the max-conn then add atleast max-conn nodes. keepPrunedConnections in paper (getNeighborsByHeuristic1() in the code)
  3. add all the neighbours of neighbours in the candidate list before finding the diverse nodes as in (1) and optionally apply keepPrunedConnections as well. Called extendCandidates in paper. ( Not sure but, I guess this is the getNeighborsByHeuristic3() in the code. but need to understand it a bit.)

Our current implementation is like (1) with some optimizations for performance. My proposed approach is similar to (2) with more strict conditions by checking diversity only for nodes which are surely connected via their neighbours. I earlier tried to implement (2) just by keeping the pruned connections till max-conn. As I mentioned, this increased the disconnected nodes by a factor instead of decreasing it. The reason I understood was that now there was more competition between nodes as every node now had max-conn number of connections all the time, triggering the findWorstNonDiverse() to be called every time for every neighbour, and most of the time the just now newly formed edge will be non-diverse and will get dropped. In this I think there might be something wrong in the implementation of checked and unchecked checking that we are doing in findWorstNonDiverse(). I am not able to get my head around that checked-unchecked code completely, but I think it give green signal to all the current neighbours as diverse incorrectly and checking just the 'unchecked' connections triggers the diversity check only for the newly added edge which it founds to be non-diverse and removes it.

I think there is huge difference between the number of connections between heuristic (1) and (2), and by inference between the current implementation and my proposed implementation, so I will post numbers around that. But I don't think that is necessarily bad, if the number of avg connections are increasing it is also increasing the recall with ExactKNN, meaning that we can get same recall for a smaller max-conn value now. I will also try to get some number around same recall with current implementation with lower max-conn and then check the search latency. This may also reduce the indexing time, lets see by how much.

@nitirajrathore could you add something to KnnGraphTester that is a test for connectedness?

I have separate scripts for those, I will post the links here and will try to get the code in the luceneutil, although compilation is a issue in that package ( which we are trying to fix separately ). We can think of adding that as optional check in the KnnGraphTester as well.

nitirajrathore commented 10 months ago

meaning that we can get same recall for a smaller max-conn value now.

I ran some tests with with max-conn 16 and max-conn = 8 and it seems like with my proposal, even max-conn=8 is better as compared to max-conn=16 of mainline. I will also add more stats.

I diverged from main branch at commit f64bb19697708bfd91e05ff4314976c991f60cbc (15 Oct). I haven't merged back from there. I think this is Lucene > 9.7.0. But will confirm.


main : Commit ID : f64bb19697708bfd91e05ff4314976c991f60cbc with max-conn = 16

recall avgCpuTime numDocs fanout maxConn beamWidth totalVisited reindexTimeMsec selectivity prefilter
0.451 17.22 1000000 0 16 100 10 406166 1.00 post-filter

candidate with max conn = 16

recall avgCpuTime numDocs fanout maxConn beamWidth totalVisited reindexTimeMsec selectivity prefilter
0.595 (+32%) 24.19 (+40%) 1000000 0 16 100 10 581090 (+43%) 1.00 post-filter

candidate with max conn = 8

recall avgCpuTime numDocs fanout maxConn beamWidth totalVisited reindexTimeMsec selectivity prefilter
0.465 (+3%) 16.35 (- 5%) 1000000 0 8 100 10 325321 (-20%) 1.00 post-filter

Interesting fact: simple implementation of using 2 for loops to find the common neighbours works better than using HashSet or IntIntHashMap(). As I think the major contribution to indexing time is because of increased number of connections. But as shown above, the indexing time decreases drastically by decreasing max-conn, while still maintaining or slightly improving the recall and search avgCpuTIme.

I will update with more scripts + info/stats and some code improvements next. Also, I am thinking I should do a 1-1 comparison of the 3 heuristics as mentioned in the paper, with the level of disconnecteness in each approach.

benwtrent commented 10 months ago

@nitirajrathore very interesting findings.

This makes me wonder if the heuristic should take a middle ground and instead of keeping all pruned connections, keep half.

I am interested in seeing the connective-ness between the heuristics.

msokolov commented 10 months ago

yes, this is a promising avenue to explore! One note of caution: we should avoid drawing strong inferences from a single dataset. I'm especially wary of GloVe because I've noticed it seems to have poor numerical properties. We especially should not be testing with random vectors. Ideally we would try several datasets, but if I had to pick one I'd recommend the minilm (384-dim) vectors we computed from wikipedia, or some internal Amazon dataset, or I know Elastic folks have been testing with a Cohere dataset? You can download the minilm data from sftp @home.apache.org; cd /home/sokolov/public_html if you have an apache login. You can also regenerate using infer_vectors.py in luceneutil, but it takes a little while

angadp commented 9 months ago

I +1 this issue and would like to try the gains in Amazon codebase. I think with lowered max-conn this can give some latency gains.

I also feel that there is a use case for indexes with less number of updates where the disconnected nodes might never be connected again. Looking forward to the results from the latest benchmark run.

I'm also wondering if there is anyway to make this configurable as an indexing flag to enable backwards compatibility.

benwtrent commented 8 months ago

OK, I added https://github.com/mikemccand/luceneutil/pull/253

Doing some local benchmarking. It seems that the more merges occur, the worse we can get.

Sometimes I get good graphs like this:

Leaf 5 has 5 layers
Leaf 5 has 137196 documents
Graph level=4 size=2, Fanout min=1, mean=1.00, max=1
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1   1   1   1   1   1
Graph level=3 size=34, Fanout min=4, mean=7.82, max=12
%   0  10  20  30  40  50  60  70  80  90 100
    0   5   6   7   7   8   8   9   9  10  12
Graph level=2 size=545, Fanout min=13, mean=15.98, max=16
%   0  10  20  30  40  50  60  70  80  90 100
    0  16  16  16  16  16  16  16  16  16  16
Graph level=1 size=8693, Fanout min=16, mean=16.00, max=16
%   0  10  20  30  40  50  60  70  80  90 100
    0  16  16  16  16  16  16  16  16  16  16
Graph level=0 size=137196, Fanout min=3, mean=24.57, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0  14  17  19  22  26  31  32  32  32  32
Graph level=4 size=2, connectedness=1.00
Graph level=3 size=34, connectedness=1.00
Graph level=2 size=545, connectedness=1.00
Graph level=1 size=8693, connectedness=1.00
Graph level=0 size=137196, connectedness=1.00

Other times I get graphs that are pretty abysmal:

Leaf 7 has 4 layers
Leaf 7 has 39628 documents
Graph level=3 size=8, Fanout min=1, mean=1.75, max=7
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1   7   7
Graph level=2 size=153, Fanout min=1, mean=1.10, max=16
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  16
Graph level=1 size=2503, Fanout min=1, mean=1.01, max=16
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  16
Graph level=0 size=39628, Fanout min=1, mean=1.00, max=32
%   0  10  20  30  40  50  60  70  80  90 100
    0   1   1   1   1   1   1   1   1   1  32
Graph level=3 size=8, connectedness=1.00
Graph level=2 size=153, connectedness=0.12
Graph level=1 size=2503, connectedness=0.01
Graph level=0 size=39628, connectedness=0.00

The numbers are so bad, I almost think this is a bug in my measurements, but it isn't clear to me where it would be.

I am going to validate older versions of Lucene to see if this changes.

benwtrent commented 8 months ago

I have done this test back in Lucene 9.4, and we still end up every once in a while a graph where the mean number of connections hovers around 1 and whose connectedness is very low. It may be that these particular docs are tightly clustered, but even then, it seems weird that it should end up being so bad.

benwtrent commented 8 months ago

OK, I did some more digging, and it seems like my data is garbage, or at least how I am reading it in. I looked at one of these extremely disconnected graphs and found that all the float32 vectors were exactly the same. So, this appears to be a bug in my code.

nitirajrathore commented 8 months ago

Hi @benwtrent,

I left Amazon but I was able to run some tests with open dataset and also with Amazon dataset before leaving. I cannot share whole lot of detail about Amazon dataset but some aggregated results are shown. Rest assured I am picking this issue on priority now.

I was able to run tests trying out different heuristics. Basically, I tried the Lucene default diversity check, which generates disconnected even with 50K nodes of minilm dataset. I further tried different variations as suggested in HNSW paper. The extended candidates approach reduced the disconnectedness but did not eliminate it completely and it increased the indexing time by many times. Next, keepPruned candidates approach with keeping max-conn connections increased the disconnected nodes. So I tried keepPruned candidates with keeping pruned connections only till max-conn/2. This also did not show any improvement. Next, I tried the new proposed hueristic but without the component of removeing the two way connections, so basically the new hueristic with just removing one side of the duplex connection in the graph. Interestingly, this also did not change the disconnectedness. This was a bit surprising to me. Then I tried new heuristic with remove-otherhalf, basically removing the two way connections completely. This completely removed disconnecteness and number of disconnected nodes at all levels was zero. But unfortunately this means that the edges at some nodes can grow beyond the max-conn. I did not get chance to find the counts and distribution of total connections at each node which goes beyond max-conn, but I assume this is something that we may not want in lucene. So I thought may be the removing duplex edges (i.e. the remove-otherhalf) is the key behind decreasing disconnectedness. So I tried default algo and keep prunned both with the remove-otherhalf. Interestingly, those two experiments also did not decrease number of disconnected nodes. Now, I was left with new heuristic with remove other half as the best option. To make sure that total connections per node do not grow beyond max-conn, I modified the algo to remove some random node in case it is not able to find the best node to remove (new heuristic with remove otherhalf and honour max-conn). This did help to keep the overall disconnectedness to zero, but it did show some nodes at level1 to be disconnected still (see comments) for minilm dataset. I tried with max-conn=8, max-conn=16, num_docs=50K and num_docs=100k. All gave zero overall disconnectedness and zero disconnected nodes at level0 but there were some disconnected nodes at level1 graph.

Anyway, I also did the experiment using Amazon dataset for new heuristic with remove otherhalf and honour max-conn. I had to do code changes again to adopt it to Lucene 9.7 version. I saw similar results. Here also default algorithm gave lot of disconnected nodes but the new heuristic gave zero disconnected nodes at level0 and zero overall disconnected nodes. But at level1 and sometimes at level2 also there were some nodes that were disconnected. I am more confident of the new heuristic now, but won't mind to run more tests and perf tests.

PR with all heuristics : https://github.com/apache/lucene/pull/12783/commits

max-conn=16 & num_docs = 50K (unless specified)

Algo no. of Disconnected Nodes at zeroth level %age overall disconnected (nodes) Comments index time
Baseline Equivalent 90 0.1740 (87 nodes) Same as baseline but with some extra parameters to allow more experiments 33s
Baseline 90 0.1740 (87) Exactly the code as in production
Extend Candidates 39 0.0760 (38) 280s
keep-pruned till max-conn 154 0.2880 (144) disconnected at level1=4 and at level2=1 36s
keep-pruned till half of max-conn 97 0.1860 (93) 34s
new heuristic without remove-otherhalf 119 0.2240 (112) 35s
new heuristic with remove-otherhalf 0 0 fully connnected at both 50K docs and 100K docs but there were errors at max-conn=8 as the size of neighbour array cannot grow and this algo allows more than max-conn connections. 33s
baseline with remove-otherhalf 91 0.1720 (86) remove-otherhalf does not give zero disconnectedness with mainline algo
keep-half-pruned with remove-otherhalf 90 0.1740 (87) no effect of remove-otherhalf with keep-half pruned 33s
new heuristic with remove otherhalf and honour max-conn 0 0 but for max-conn=8 and docs=100k I saw 35 disconnected nodes at level1. There were no disconnected nodes at level0 even with max-conn=8 36s
Amazon data set baseline (lucene-9.7 release) (75 - 338) in each segment 0.26 - 0.48 % docs=7.5M, segments=13
Amazon data new heuristic with remove otherhalf and honour max-conn 77 in 3 out of 10 segments 0% overall docs=7.5M, segments=10, Indexing time 4% increase, 6% increase in red-line queries/sec,
26% decrease in hnsw nodes visisted but 8% increase in avg query latency.
benwtrent commented 8 months ago

@nitirajrathore very interesting results. This sort of indicates to me that no matter the heuristic, we just need a second pass over the graph to ensure connectedness and fix it up :/

Even though the graph is different, the idea is the same here: https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphIndexBuilder.java (see: reconnectOrphanedNodes())

I will see about testing your new "new heuristic with remove otherhalf and honour max-conn" as if I am reading this correctly, seems the most promising. But, I suspect, even in extreme scenarios, we will still need a second pass over the graph to ensure graph connectedness.

msokolov commented 6 months ago

I want to revive this discussion about disconnectedness. I think the two-pass idea is where we would have to go in order to ensure a connected graph, and in order to implement that we need to incorporate a segmentation or labeling step. Before we get to trying to implement that in a "production" indexing setup though I would like to incorporate some of these tools in our test framework. This will enable us to provide test assertions that are aware of graph connectedness. Today we occasionally see random failures due to disconnected graphs and this has led us to have looser test assertions than we might otherwise be able to do. As a first step we can create (or re-use -- did we already create them here?) some tools for counting and sizing connected components, and then also for patching them together. Could we relax the maxconns when patching in order to avoid re-disconnecting?

msokolov commented 6 months ago

Oh!, I see we added some tooling in https://github.com/mikemccand/luceneutil/pull/253 as part of KnnGraphTester. Maybe we can migrate some of this to lucene's test-framework

benwtrent commented 6 months ago

@msokolov I think adding a "reachable" test to Lucene would be nice. The main goal of such a test would be ensuring that every node is eventually reachable on every layer. The tricky part is I am sure such a test would be noisy until we have a "connectedness check", especially for random vectors.

I do like having something in Lucene-util though as I have found the larger statistics useful when debugging why a particular graph or change breaks.

So, something in both places would be good.

msokolov commented 6 months ago

I was thinking of a baby step: count the number of nodes that are reachable and then use that in assertions like https://github.com/apache/lucene/blob/bf193a712535e416edbc854fb10e788ea79d397b/lucene/core/src/test/org/apache/lucene/search/BaseVectorSimilarityQueryTestCase.java#L144

and this one https://github.com/apache/lucene/blob/bf193a712535e416edbc854fb10e788ea79d397b/lucene/core/src/test/org/apache/lucene/search/BaseVectorSimilarityQueryTestCase.java#L310

which failed randomly on policeman jenkins the other day with an off-by-one - I couldn't prove it in the time I had but I suspect it was due to disconnectedness

msokolov commented 6 months ago

I struggle to make this work though since the changes to make everything more typesafe have also made the interesting bits inaccessible. EG I thought of adding something like this:

        for (LeafReaderContext ctx : reader.leaves()) {
          CodecReader codecReader = (CodecReader) ctx.reader();
          HnswGraph graph = ((HnswGraphProvider) codecReader.getVectorReader()).getGraph(vectorField);
          if (HnswTestUtil.isFullyConnected(graph) == false) {
            // for now just bail out.
            return;
          }
        }

to BaseVector...QueryTest however the VectorReader returned from that is hiding its HnswGraph behind a few layers of abstraction:

class org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader cannot be cast to class org.apache.lucene.codecs.HnswGraphProvider (org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader and org.apache.lucene.codecs.HnswGraphProvider are in unnamed module of loader 'app')

not sure what the approved way to go about this would be

benwtrent commented 6 months ago

@msokolov in the HNSW codec, we do something like this already when gathering the underlying graph.

I would do something like:

 if (FilterLeafReader.unwrap(ctx.reader()) instanceof CodecReader codecReader) {
   KnnVectorsReader currKnnVectorsReader = codecReader.getVectorReader();
   if (currKnnVectorsReader instanceof PerFieldKnnVectorsFormat.FieldsReader perFields) {
     currKnnVectorsReader = perFields.get(vectorField);
   }
   if (currKnnVectorsReader instanceof HnswGraphProvider graphProvider) {
     if (HnswTestUtil.isFullyConnected(graphProvider.getGraph(vectorField)) == false) {
        // for now just bail out.
        return;
     }
   }
 }
msokolov commented 6 months ago

Thanks @benwtrent I opened https://github.com/apache/lucene/pull/13260

CloudMarc commented 5 months ago

Is there a clean separation between:

  1. HNSW ANN (e.g. clean room implementation) is known to have issues with unreachable nodes for certain configurations. and
  2. The Lucene implementation of HNSW ANN has additional issues that other implementations do not have ?

If this is mostly about the first, then it's a known issue with known mitigations.

If it's mostly about the second, doesn't that point to some specific defect in the implementation, a defect which could/should be fixed (in contrast, e.g., to two-pass approaches that may address both 1 and 2) ?

msokolov commented 5 months ago

I'm not aware of any Lucene-specific issue here. We see this mostly in unit tests, but it has also been replicated with production data. Although one can speculate this might be more of an issue in Lucene because of its segmented multi-graph structure; I'm not sure since I don't have all that much experience of HNSW in other settings. I would be interested to hear about the known mitigations you mentioned.

benwtrent commented 5 months ago

@msokolov I am not sure what particular mitigations @CloudMarc is talking about, but I know of a couple of options outlined here: https://github.com/apache/lucene/issues/12627#issuecomment-1814046690

Though I would prefer not having yet another configuration item for HNSW. I think we should do a second "clean up pass" over HNSW (like JVector does for Vamana), to ensure we don't have orphaned nodes or clusters.

A second pass over the layers wouldn't be too bad performance wise (we wouldn't have to do many, if any at all, vector comparisons in the second pass), and it would be mostly for adding a couple of backlinks that were originally removed.

CloudMarc commented 5 months ago

Re "known mitigations" - I simply meant common HNSW/ANN configuration changes like increasing MaxConnections and BeamWidth.

Thanks for mentioning precedence for the two-pass approach. This seems useful: https://weaviate.io/blog/ann-algorithms-vamana-vs-hnsw#vamana-implementation-details

I'm +1 for the 2-pass approach, fwiw.

CloudMarc commented 5 months ago

One might hypothesize that this issue is due to adapting HNWS to Lucene's approach to segmentation.

This seems undercut by these observations: https://github.com/apache/lucene/issues/12627#issuecomment-1767662289

Key question: Does single-segment Lucene have the same rate of dis-connectedness as a stock or reference HNSW implementation, or more? If it's significantly more, that may indicate an implementation defect.

xwt1 commented 4 months ago

Hi @benwtrent,

I left Amazon but I was able to run some tests with open dataset and also with Amazon dataset before leaving. I cannot share whole lot of detail about Amazon dataset but some aggregated results are shown. Rest assured I am picking this issue on priority now.

I was able to run tests trying out different heuristics. Basically, I tried the Lucene default diversity check, which generates disconnected even with 50K nodes of minilm dataset. I further tried different variations as suggested in HNSW paper. The extended candidates approach reduced the disconnectedness but did not eliminate it completely and it increased the indexing time by many times. Next, keepPruned candidates approach with keeping max-conn connections increased the disconnected nodes. So I tried keepPruned candidates with keeping pruned connections only till max-conn/2. This also did not show any improvement. Next, I tried the new proposed hueristic but without the component of removeing the two way connections, so basically the new hueristic with just removing one side of the duplex connection in the graph. Interestingly, this also did not change the disconnectedness. This was a bit surprising to me. Then I tried new heuristic with remove-otherhalf, basically removing the two way connections completely. This completely removed disconnecteness and number of disconnected nodes at all levels was zero. But unfortunately this means that the edges at some nodes can grow beyond the max-conn. I did not get chance to find the counts and distribution of total connections at each node which goes beyond max-conn, but I assume this is something that we may not want in lucene. So I thought may be the removing duplex edges (i.e. the remove-otherhalf) is the key behind decreasing disconnectedness. So I tried default algo and keep prunned both with the remove-otherhalf. Interestingly, those two experiments also did not decrease number of disconnected nodes. Now, I was left with new heuristic with remove other half as the best option. To make sure that total connections per node do not grow beyond max-conn, I modified the algo to remove some random node in case it is not able to find the best node to remove (new heuristic with remove otherhalf and honour max-conn). This did help to keep the overall disconnectedness to zero, but it did show some nodes at level1 to be disconnected still (see comments) for minilm dataset. I tried with max-conn=8, max-conn=16, num_docs=50K and num_docs=100k. All gave zero overall disconnectedness and zero disconnected nodes at level0 but there were some disconnected nodes at level1 graph.

Anyway, I also did the experiment using Amazon dataset for new heuristic with remove otherhalf and honour max-conn. I had to do code changes again to adopt it to Lucene 9.7 version. I saw similar results. Here also default algorithm gave lot of disconnected nodes but the new heuristic gave zero disconnected nodes at level0 and zero overall disconnected nodes. But at level1 and sometimes at level2 also there were some nodes that were disconnected. I am more confident of the new heuristic now, but won't mind to run more tests and perf tests.

PR with all heuristics : https://github.com/apache/lucene/pull/12783/commits

max-conn=16 & num_docs = 50K (unless specified)

Algo no. of Disconnected Nodes at zeroth level %age overall disconnected (nodes) Comments index time Baseline Equivalent 90 0.1740 (87 nodes) Same as baseline but with some extra parameters to allow more experiments 33s Baseline 90 0.1740 (87) Exactly the code as in production
Extend Candidates 39 0.0760 (38) 280s keep-pruned till max-conn 154 0.2880 (144) disconnected at level1=4 and at level2=1 36s keep-pruned till half of max-conn 97 0.1860 (93) 34s new heuristic without remove-otherhalf 119 0.2240 (112) 35s new heuristic with remove-otherhalf 0 0 fully connnected at both 50K docs and 100K docs but there were errors at max-conn=8 as the size of neighbour array cannot grow and this algo allows more than max-conn connections. 33s baseline with remove-otherhalf 91 0.1720 (86) remove-otherhalf does not give zero disconnectedness with mainline algo keep-half-pruned with remove-otherhalf 90 0.1740 (87) no effect of remove-otherhalf with keep-half pruned 33s new heuristic with remove otherhalf and honour max-conn 0 0 but for max-conn=8 and docs=100k I saw 35 disconnected nodes at level1. There were no disconnected nodes at level0 even with max-conn=8 36s Amazon data set baseline (lucene-9.7 release) (75 - 338) in each segment 0.26 - 0.48 % docs=7.5M, segments=13
Amazon data new heuristic with remove otherhalf and honour max-conn 77 in 3 out of 10 segments 0% overall docs=7.5M, segments=10, Indexing time 4% increase, 6% increase in red-line queries/sec, 26% decrease in hnsw nodes visisted but 8% increase in avg query latency.

Hello, nitirajrathore! I came across your discussion regarding various tests you conducted with the "minilm" dataset and found it particularly interesting. I'm currently engaged in similar research and am keen on exploring this further. Could you please share how I might be able to access the "minilm" dataset, or point me to any resources where it might be available? It seems like it is a open dataset. Any guidance you could provide would be greatly appreciated.

msokolov commented 2 months ago

I'd like to take a stab at the "second pass" idea for patching up disconnected graph components. As a first step I think we ought to add state to the HnswGraphBuilder in order to clearly indicate that all nodes have been added and we are now engaged in finalizing the graph. My plan is to add a getCompletedHnswGraph() method that can be called when flushing, leaving the existing getHnswGraph() method that allows callers to observe the graph during construction.

msokolov commented 1 month ago

also backported to 9x