apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.67k stars 1.03k forks source link

Use hash set for visited nodes in HNSW search? [LUCENE-10404] #11440

Open asfimport opened 2 years ago

asfimport commented 2 years ago

While searching each layer, HNSW tracks the nodes it has already visited using a BitSet. We could look into using something like IntHashSet instead. I tried out the idea quickly by switching to IntIntHashMap (which has already been copied from hppc) and saw an improvement in index performance.

Baseline: 760896 msec to write vectors Using IntIntHashMap: 733017 msec to write vectors

I noticed search performance actually got a little bit worse with the change – that is something to look into.

For background, it's good to be aware that HNSW can visit a lot of nodes. For example, on the glove-100-angular dataset with \~1.2 million docs, HNSW search visits \~1000 - 15,000 docs depending on the recall. This number can increase when searching with deleted docs, especially if you hit a "pathological" case where the deleted docs happen to be closest to the query vector.


Migrated from LUCENE-10404 by Julie Tibshirani (@jtibshirani), updated Aug 07 2022

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

For example, on the glove-100-angular dataset with \~1.2 million docs, HNSW search visits \~1000 - 15,000 docs depending on the recall.

So if some searches at index time need to visit \~15k docs, then likely we'd need a hash set with a backing array that has 32,768 entries (smallest power of two that is 2x greater than the number of items in the hash set).

Since we'd never shrink the backing array, this means that even when we only visit 1,000 nodes, we would then have to clear the 32k entries of the backing array, so the hashset#clear calls might not be free (though still cheaper than on FixedBitSet).

This makes me wonder if we should consider using a specialized hash set implementation for this use-case, e.g. by using a FixedBitSet to track used buckets instead of sentinel values (which I believe is what HPPC's IntHashSet is doing). This would make IntHashSet#add calls a bit more expensive (maybe?), but then clearing the Set would only consist of clearing the BitSet that tracks used slots instead of the backing array, so we'd need to clear 32,768 bits = 4kB of data instead of 32,768*4=128kB.

asfimport commented 2 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

As a note, #11628 changes the index strategy so that we build the graph as each document is added, instead of waiting until 'flush'. In the PR, graph building still shares a single FixedBitSet to track the 'visited' set, but it's continuously resized since we don't know the full number of docs up-front. So maybe switching to a hash set could help even more after that change is merged.

asfimport commented 2 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I tried using IntIntHashMap (mapping to 1 for visited nodes) and indeed does seem to be a small speedup. I haven't had a chance to run luceneutil nor look at profiler output, but here are some numbers from KnnGraphTester for an internal dataset. The numbers can be a bit noisy, but are consistently better for the hash map version.

IntIntHashMap

recall  latency nDoc    fanout  maxConn beamWidth       visited index ms 0.935    0.37   10000   0       16      32      100     1566 0.965    0.49   10000   50      16      32      150     0 0.962    0.41   10000   0       16      64      100     2655 0.982    0.57   10000   50      16      64      150     0 0.941    0.38   10000   0       32      32      100     1473 0.969    0.51   10000   50      32      32      150     0 0.966    0.45   10000   0       32      64      100     2611 0.985    0.59   10000   50      32      64      150     0 0.907    0.52   100000  0       16      32      100     19850 0.940    0.72   100000  50      16      32      150     0 0.941    0.60   100000  0       16      64      100     38614 0.966    0.84   100000  50      16      64      150     0 0.916    0.55   100000  0       32      32      100     19243 0.949    0.75   100000  50      32      32      150     0 0.952    0.66   100000  0       32      64      100     38205 0.973    0.93   100000  50      32      64      150     0 0.859    0.66   1000000 0       16      32      100     273112 0.897    0.92   1000000 50      16      32      150     0 0.917    0.85   1000000 0       16      64      100     523325 0.946    1.06   1000000 50      16      64      150     0 0.874    0.80   1000000 0       32      32      100     274816 0.913    1.05   1000000 50      32      32      150     0 0.929    0.98   1000000 0       32      64      100     564762

baseline

recall  latency nDoc    fanout  maxConn beamWidth       visited index ms 0.935    0.38   10000   0       16      32      100     1614 0.965    0.50   10000   50      16      32      150     0 0.962    0.45   10000   0       16      64      100     2687 0.982    0.57   10000   50      16      64      150     0 0.941    0.40   10000   0       32      32      100     1504 0.969    0.51   10000   50      32      32      150     0 0.966    0.44   10000   0       32      64      100     2652 0.985    0.58   10000   50      32      64      150     0 0.907    0.54   100000  0       16      32      100     21449 0.940    0.74   100000  50      16      32      150     0 0.941    0.64   100000  0       16      64      100     39962 0.966    0.88   100000  50      16      64      150     0 0.916    0.59   100000  0       32      32      100     20554 0.949    0.80   100000  50      32      32      150     0 0.952    0.72   100000  0       32      64      100     40980 0.973    1.04   100000  50      32      64      150     0 0.859    0.75   1000000 0       16      32      100     300514 0.897    0.96   1000000 50      16      32      150     0 0.917    0.84   1000000 0       16      64      100     563259 0.946    1.12   1000000 50      16      64      150     0 0.874    0.86   1000000 0       32      32      100     303186 0.913    1.09   1000000 50      32      32      150     0 0.929    1.04   1000000 0       32      64      100     580725 0.958    1.38   1000000 50      32      64      150     0

 

asfimport commented 2 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

Those numbers look good! Is my understanding right that these experiments use k=10, and fanout = 0 and 50? Maybe we could also try with a high fanout (like 100 or 500) to double-check the case when we need to visit a larger number of nodes.

asfimport commented 2 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

The default topK in KnnGraphTester is 100, so these test runs are maintaining results queues of 100 or 150 (when searching). During indexing this is driven by beamWidth, and 32/64 is lower than is typical, I think. Still I think it's encouraging that we see gains in both searching (when the queue size is 100-150) and indexing, when it is 32-64.

I won't be able to run more tests for a few days, but I agree that it would be interesting to see how the gains correlate with the queue sizes. But I was motivated to get some quick look! Will run some more exhaustive tests next week.

asfimport commented 2 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Here is a test using GloVe 100-dim vectors plus much more aggressive indexing settings, and we can see that here the IntIntHashMap is adding cost

baseline

recall  latency nDoc    fanout  maxConn beamWidth       visited index ms

0.991    0.92   10000   50      64      500     150     12068 0.996    1.11   10000   100     64      500     200     0 0.999    1.45   10000   200     64      500     300     0 1.000    1.94   10000   400     64      500     500     0 0.955    2.53   100000  50      64      500     150     463142 0.973    3.03   100000  100     64      500     200     0 0.988    4.44   100000  200     64      500     300     0 0.997    6.57   100000  400     64      500     500     0 0.895    3.44   1000000 50      64      500     150     9811483 0.920    4.33   1000000 100     64      500     200     0 0.950    6.20   1000000 200     64      500     300     0 0.974    9.53   1000000 400     64      500     500     0

IntIntHashMap

recall  latency nDoc    fanout  maxConn beamWidth       visited index ms 0.991    1.03   10000   50      64      500     150     13274 0.996    1.24   10000   100     64      500     200     0 0.999    1.62   10000   200     64      500     300     0 1.000    2.09   10000   400     64      500     500     0 0.955    2.47   100000  50      64      500     150     485131 0.973    3.31   100000  100     64      500     200     0 0.988    4.66   100000  200     64      500     300     0 0.997    7.26   100000  400     64      500     500     0 0.895    3.58   1000000 50      64      500     150     10173818 0.920    4.49   1000000 100     64      500     200     0 0.950    6.45   1000000 200     64      500     300     0 0.974    9.91   1000000 400     64      500     500     0

asfimport commented 2 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

Checking I understand the numbers: in addition to indexing slowing down, it looks like search latency is a bit worse too. The main difference is that the graph is better connected (higher maxConn) and we explore more nodes during index (beamWidth). The slowdown is not huge but significant (\~5% slower for both index + search).

 

This would support the theory that for larger numbers of 'visited' nodes, the IntIntHashMap solution doesn't perform as well. Maybe we could consider a more custom approach as Adrien suggests ?

asfimport commented 2 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Yes I think your understanding is correct - another difference is the dimension of the vectors which was 256 in the first case and 100 in the second. I think this second one will tend to emphasize the cost of the graph traversal, since dot-product costs will be less as a proportion of the overall time spent. Indeed I think some specialization would help there.

benwtrent commented 11 months ago

While exploring other things, I noticed this: https://github.com/apache/lucene/pull/12789

There is definitely room for improvement here. We should test: