apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.58k stars 1.01k forks source link

Try applying bipartite graph reordering to KNN graph node ids #13565

Open msokolov opened 1 month ago

msokolov commented 1 month ago

Description

We recently added BPIndexReorderer to enable sorting indexed docids using the BP approach to cluster documents with similar postings in some chosen set of fields. The same approach can also be used to sort node ids in a graph, eg see https://dl.acm.org/doi/10.1145/2939672.2939862 and https://arxiv.org/pdf/2403.01797 applies similar ideas to shards. I think we could apply the algorithm we have already implemented to HNSW node ids using document/document dot products as the distant metric that drives the partitioning and we might get some compression benefits since nearby nodes will tend to be linked together and this should reduce the overall node id deltas. We might also get memory locality benefits; possibly the memory access patterns could get a bit less random.

jpountz commented 1 month ago

We might also get memory locality benefits

This is what got me to thinking of BP for HNSW search: intuitively, it could help a lot when the dataset size exceeds the size of the page cache?

msokolov commented 1 month ago

This is what got me to thinking of BP for HNSW search: intuitively, it could help a lot when the dataset size exceeds the size of the page cache?

Yes, and as we keep compressing the vectors themselves, the graph takes up more and more space, relatively speaking

tteofili commented 1 month ago

this paper from SIGIR'24 seems to do exactly this as a first step in their Block Max Pruning technique.

mikemccand commented 1 month ago

This is what got me to thinking of BP for HNSW search: intuitively, it could help a lot when the dataset size exceeds the size of the page cache?

I think that gains might be astounding?

Similar vectors would be stored near each other in the .vec / .veq file, so paging in larger blocks / OS readahead could be very effective (though we may have to turn off MADV_RANDOM and see if it helps). It should also mean less broad exploration of the graph: once you find your "neighborhood" of similar-ish vectors you spend some effort there and more quickly get to the top K.

msokolov commented 3 weeks ago

Thinking about the implementation a bit I realized that when we reorder the vector storage for the benefit of HNSW we will still need a way to iterate over vector values in docid order, and we need to map from vector ord to docid when searching. None of the existing vector formats handles this: they are optimized for vectors that are stored in docid order. To make some progress, I'd start with an initial implementation that stores these mappings in a naive way, eg as fully-populated arrays, and we can use that to measure how much improvement we see in graph storage size and search performance. Then we could revisit and use some more efficient data structure for the ord/doc mapping. Since the ordinals would no longer be increasing with docid we can't use DirectMonotonicReader/Writer any more, but it needs to be something more like what SortedNumericDocValues does. I'm not super familiar with what we have - I wonder if there is some reusable code that would help.

jpountz commented 3 weeks ago

This sounds to me like you are considering renumbering node IDs in the HNSW graph without renumbering doc IDs. I'm curious if you considered renumbering doc IDs like BPIndexReorderer?

msokolov commented 3 weeks ago

Yes, I'd like this to be compatible with a different index sort for docids. I think it makes sense when you have hybrid search; keywords + semantic in the same query. But I suppose if you have a purely semantic index you would ideally want the sorting to be congruent. Let me think about it - maybe we can achieve both somehow?

msokolov commented 3 weeks ago

One thing that confused me about the BPIndexReorderer solution: what happens if the IWC has indexSort configured and we apply the BPReorderingMergePolicy? I guess the index sort is discarded -- or becomes a secondary sort? But then various query optimizations that rely on index sorting would become invalidated, although they might not realize it? Would it make sense to define an option to IndexSort that allows configuring BP reordering on a field rather than requiring to configure a special merge policy? Then I suppose if one applied it to a KnnVector field we would use vector-based BP.

msokolov commented 3 weeks ago

OK I'm beginning to see that it would be easier to start with a tool that rewrites the index as you did with the BPIndexReorderer -- integrating this thing as a new index format or as a new kind of IndexSort introduces all sorts of other complications that are not needed if we want to just get an idea of what the impact of such a change might be.

msokolov commented 3 weeks ago

I made this tool; while testing it I ran into some unexpected wrinkles relating to our vector format. I created a new index from an existing one, with a new docid order by:

              writer.addIndexes(SortingCodecReader.wrap((CodecReader) ctx.reader(), docMap, null));

But this recreates the HNSW graph doing unneccessary work, when all we really want to do is to renumber it. And the new graph it creates ignores the parameters that were used to create the original graph, substituting defaults, such as M=16. It makes me wonder if we ought to write M to the index as per-field metadata so it can be reused by tools such as this that may not have access to a schema, or in general when merging the field in the future.

I guess for my preliminary tests I can simply hack the DEFAULT_MAX_CONNECTIONS to fit my test data, but I'd like to hear folks' opinions on how we should address this.

update: I found we set this DEFAULT_MAX_CONN in two different places! Once in Lucene99HnswVectorsWriter and also in HnswGraphBuilder. In my case it's the latter one that seems to govern, but this is pretty unintuitive.

mikemccand commented 2 weeks ago

update: I found we set this DEFAULT_MAX_CONN in two different places! Once in Lucene99HnswVectorsWriter and also in HnswGraphBuilder. In my case it's the latter one that seems to govern, but this is pretty unintuitive.

Could we at least factor this out so they share a single static final int constant?

msokolov commented 2 weeks ago

Yes, that seems incontrovertibly correct. I was planning to include it along with some other changes, but let's fix the easy stuff first. https://github.com/apache/lucene/pull/13674

msokolov commented 2 weeks ago

Another thing I stumbled on that surprised me is that SortingCodecReader now requires loading vectors into RAM in order to do its work. It used to rely on random access to avoid this, but we stopped doing that in https://github.com/apache/lucene/pull/11964. I guess I was fully aware but now I think we need to find a way to provide random access when possible. In that issue it was stated we can't rely on readers supporting random access, but in fact I think they always do. Certainly in the main path random access is supported. Perhaps we can handle with a class cast in SortingCodecReader so it can use RA when available. But I still am not sure why we don't just make it always available.

jpountz commented 2 weeks ago

Sorry that this change is being annoying for BP, the goal of this change was to simplify the API, which I found a bit messy at the time with both a sequential and random-access API. Since the random-access API was only used internally by the file format, it felt cleaner to only expose the sequential access API to users, and also consistent with doc values. If this is being limiting for more interesting things we want to do, we could look into replacing the sequential-access API of vectors with a random-access API in Lucene 10.

msokolov commented 2 weeks ago

Hi thanks for that @jpountz, no worries; this was something we all agreed on. I'm able to continue with the "research" part of this by simply increasing heap size - it's not a blocker.

At the same time I think we might want to reintroduce random-access vector readers as a first-class API for other reasons. Even the current case of merging multiple large segments containing vectors would be affected by this, wouldn't it? Since SortingCodecReader is used by IndexWriter when merging sorted indexes, it means that in that case all vector data of segments being merged is held on heap, potentially requiring quite a lot of heap allocation when instead we could read from "disk" at the cost of some random accesses. I guess disk random accesses are generally to be avoided, but given that the alternative is to "page in" every vector page to the heap, I would think we would prefer to let the OS do the paging as usual for our index data.

msokolov commented 2 weeks ago

In the meantime, just to let you know I do have a dirt path implementation of this (multithreading not yet working, totally recomputes centroids on every iteration, etc), but it isn't yet yielding the hoped-for improvements in hnsw graph (vex file) size. I augmented KnnGraphTester to print out the average delta between node ids, and this is being cut in half, as we would expect from the BP, but it doesn't yield much reduction in index size. It might just be that the indexes are too small for VInt encoding to be impacted much if node/docid deltas were previously averaging around 55000 and are now around 25000 (still takes 3 bytes per delta on average). In this case I saw vex file size go from 21048922 to 20624822; only a few % reduction. I'm continuing to test with larger indexes and different vector data sets.

mikemccand commented 2 weeks ago

readVInt is also a hotspot at search time for *VectorQuery (see this comment) so maybe that few % reduction in storage (and maybe more for larger graphs) translates to faster searching too?

This change should also give a big improvement to cold indices (where the Lucene .vec/q is larger than free OS RAM), but, maybe only if we allow the OS's "normal" readahead to apply so that we see the benefit of the locality (similar vectors will be closer in the file now)? Lucene currently tells the OS not to readahead (MADV_RANDOM I think?) so this may be tricky to test.

jpountz commented 2 weeks ago

readVInt is also a hotspot at search time for *VectorQuery

We should use group-varint, like for tail postings?

msokolov commented 2 weeks ago

We should use group-varint, like for tail postings?

Does this choose a single bit-width for a group of postings? That sounds like it would produce savings here, yes. Also it would be faster to decode, right?

More data; on a 2M-document index over a different dataset I did see a larger size reduction, from 88970639 to 60536909, a 32% reduction, and it's looking like some good latency reduction too, even after index is warm, possibly due to the VInt decoding being cheaper? Again this seem to vary quite a bit with vector dataset and index size, but it is looking more promising now.

jpountz commented 2 weeks ago

Does this choose a single bit-width for a group of postings?

No, each posting can still have a different byte width, but it does the decoding in a way that doesn't have unpredictable conditionals.