Use hash set for visited nodes in HNSW search? [LUCENE-10404]

asfimport commented 2 years ago

While searching each layer, HNSW tracks the nodes it has already visited using a BitSet. We could look into using something like IntHashSet instead. I tried out the idea quickly by switching to IntIntHashMap (which has already been copied from hppc) and saw an improvement in index performance.

Baseline: 760896 msec to write vectors Using IntIntHashMap: 733017 msec to write vectors

I noticed search performance actually got a little bit worse with the change – that is something to look into.

For background, it's good to be aware that HNSW can visit a lot of nodes. For example, on the glove-100-angular dataset with \~1.2 million docs, HNSW search visits \~1000 - 15,000 docs depending on the recall. This number can increase when searching with deleted docs, especially if you hit a "pathological" case where the deleted docs happen to be closest to the query vector.

Migrated from LUCENE-10404 by Julie Tibshirani (@jtibshirani), updated Aug 07 2022

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

For example, on the glove-100-angular dataset with \~1.2 million docs, HNSW search visits \~1000 - 15,000 docs depending on the recall.

So if some searches at index time need to visit \~15k docs, then likely we'd need a hash set with a backing array that has 32,768 entries (smallest power of two that is 2x greater than the number of items in the hash set).

Since we'd never shrink the backing array, this means that even when we only visit 1,000 nodes, we would then have to clear the 32k entries of the backing array, so the hashset#clear calls might not be free (though still cheaper than on FixedBitSet).

This makes me wonder if we should consider using a specialized hash set implementation for this use-case, e.g. by using a FixedBitSet to track used buckets instead of sentinel values (which I believe is what HPPC's IntHashSet is doing). This would make IntHashSet#add calls a bit more expensive (maybe?), but then clearing the Set would only consist of clearing the BitSet that tracks used slots instead of the backing array, so we'd need to clear 32,768 bits = 4kB of data instead of 32,768*4=128kB.

asfimport commented 2 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

As a note, #11628 changes the index strategy so that we build the graph as each document is added, instead of waiting until 'flush'. In the PR, graph building still shares a single FixedBitSet to track the 'visited' set, but it's continuously resized since we don't know the full number of docs up-front. So maybe switching to a hash set could help even more after that change is merged.

asfimport commented 2 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I tried using IntIntHashMap (mapping to 1 for visited nodes) and indeed does seem to be a small speedup. I haven't had a chance to run luceneutil nor look at profiler output, but here are some numbers from KnnGraphTester for an internal dataset. The numbers can be a bit noisy, but are consistently better for the hash map version.

IntIntHashMap

recall latency nDoc fanout maxConn beamWidth visited index ms 0.935 0.37 10000 0 16 32 100 1566 0.965 0.49 10000 50 16 32 150 0 0.962 0.41 10000 0 16 64 100 2655 0.982 0.57 10000 50 16 64 150 0 0.941 0.38 10000 0 32 32 100 1473 0.969 0.51 10000 50 32 32 150 0 0.966 0.45 10000 0 32 64 100 2611 0.985 0.59 10000 50 32 64 150 0 0.907 0.52 100000 0 16 32 100 19850 0.940 0.72 100000 50 16 32 150 0 0.941 0.60 100000 0 16 64 100 38614 0.966 0.84 100000 50 16 64 150 0 0.916 0.55 100000 0 32 32 100 19243 0.949 0.75 100000 50 32 32 150 0 0.952 0.66 100000 0 32 64 100 38205 0.973 0.93 100000 50 32 64 150 0 0.859 0.66 1000000 0 16 32 100 273112 0.897 0.92 1000000 50 16 32 150 0 0.917 0.85 1000000 0 16 64 100 523325 0.946 1.06 1000000 50 16 64 150 0 0.874 0.80 1000000 0 32 32 100 274816 0.913 1.05 1000000 50 32 32 150 0 0.929 0.98 1000000 0 32 64 100 564762

baseline

recall latency nDoc fanout maxConn beamWidth visited index ms 0.935 0.38 10000 0 16 32 100 1614 0.965 0.50 10000 50 16 32 150 0 0.962 0.45 10000 0 16 64 100 2687 0.982 0.57 10000 50 16 64 150 0 0.941 0.40 10000 0 32 32 100 1504 0.969 0.51 10000 50 32 32 150 0 0.966 0.44 10000 0 32 64 100 2652 0.985 0.58 10000 50 32 64 150 0 0.907 0.54 100000 0 16 32 100 21449 0.940 0.74 100000 50 16 32 150 0 0.941 0.64 100000 0 16 64 100 39962 0.966 0.88 100000 50 16 64 150 0 0.916 0.59 100000 0 32 32 100 20554 0.949 0.80 100000 50 32 32 150 0 0.952 0.72 100000 0 32 64 100 40980 0.973 1.04 100000 50 32 64 150 0 0.859 0.75 1000000 0 16 32 100 300514 0.897 0.96 1000000 50 16 32 150 0 0.917 0.84 1000000 0 16 64 100 563259 0.946 1.12 1000000 50 16 64 150 0 0.874 0.86 1000000 0 32 32 100 303186 0.913 1.09 1000000 50 32 32 150 0 0.929 1.04 1000000 0 32 64 100 580725 0.958 1.38 1000000 50 32 64 150 0

asfimport commented 2 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

Those numbers look good! Is my understanding right that these experiments use k=10, and fanout = 0 and 50? Maybe we could also try with a high fanout (like 100 or 500) to double-check the case when we need to visit a larger number of nodes.

asfimport commented 2 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

The default topK in KnnGraphTester is 100, so these test runs are maintaining results queues of 100 or 150 (when searching). During indexing this is driven by beamWidth, and 32/64 is lower than is typical, I think. Still I think it's encouraging that we see gains in both searching (when the queue size is 100-150) and indexing, when it is 32-64.

I won't be able to run more tests for a few days, but I agree that it would be interesting to see how the gains correlate with the queue sizes. But I was motivated to get some quick look! Will run some more exhaustive tests next week.

asfimport commented 2 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Here is a test using GloVe 100-dim vectors plus much more aggressive indexing settings, and we can see that here the IntIntHashMap is adding cost

baseline

`recall latency nDoc fanout maxConn beamWidth visited index ms`

0.991 0.92 10000 50 64 500 150 12068 0.996 1.11 10000 100 64 500 200 0 0.999 1.45 10000 200 64 500 300 0 1.000 1.94 10000 400 64 500 500 0 0.955 2.53 100000 50 64 500 150 463142 0.973 3.03 100000 100 64 500 200 0 0.988 4.44 100000 200 64 500 300 0 0.997 6.57 100000 400 64 500 500 0 0.895 3.44 1000000 50 64 500 150 9811483 0.920 4.33 1000000 100 64 500 200 0 0.950 6.20 1000000 200 64 500 300 0 0.974 9.53 1000000 400 64 500 500 0

IntIntHashMap

recall latency nDoc fanout maxConn beamWidth visited index ms 0.991 1.03 10000 50 64 500 150 13274 0.996 1.24 10000 100 64 500 200 0 0.999 1.62 10000 200 64 500 300 0 1.000 2.09 10000 400 64 500 500 0 0.955 2.47 100000 50 64 500 150 485131 0.973 3.31 100000 100 64 500 200 0 0.988 4.66 100000 200 64 500 300 0 0.997 7.26 100000 400 64 500 500 0 0.895 3.58 1000000 50 64 500 150 10173818 0.920 4.49 1000000 100 64 500 200 0 0.950 6.45 1000000 200 64 500 300 0 0.974 9.91 1000000 400 64 500 500 0

asfimport commented 2 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

Checking I understand the numbers: in addition to indexing slowing down, it looks like search latency is a bit worse too. The main difference is that the graph is better connected (higher maxConn) and we explore more nodes during index (beamWidth). The slowdown is not huge but significant (\~5% slower for both index + search).

This would support the theory that for larger numbers of 'visited' nodes, the IntIntHashMap solution doesn't perform as well. Maybe we could consider a more custom approach as Adrien suggests ?

asfimport commented 2 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Yes I think your understanding is correct - another difference is the dimension of the vectors which was 256 in the first case and 100 in the second. I think this second one will tend to emphasize the cost of the graph traversal, since dot-product costs will be less as a proportion of the overall time spent. Indeed I think some specialization would help there.

benwtrent commented 11 months ago

While exploring other things, I noticed this: https://github.com/apache/lucene/pull/12789

There is definitely room for improvement here. We should test:

when to use Sparse vs. Fixed bit set
At what point of percentage exploration should we even drop down to a IntHashMap. Though I am not sure that will ever be worth it when compared against a fixed bit set.

apache / lucene