Approximate nearest vector search [LUCENE-9004]

asfimport commented 5 years ago

"Semantic" search based on machine-learned vector "embeddings" representing terms, queries and documents is becoming a must-have feature for a modern search engine. SOLR-12890 is exploring various approaches to this, including providing vector-based scoring functions. This is a spinoff issue from that.

The idea here is to explore approximate nearest-neighbor search. Researchers have found an approach based on navigating a graph that partially encodes the nearest neighbor relation at multiple scales can provide accuracy > 95% (as compared to exact nearest neighbor calculations) at a reasonable cost. This issue will explore implementing HNSW (hierarchical navigable small-world) graphs for the purpose of approximate nearest vector search (often referred to as KNN or k-nearest-neighbor search).

At a high level the way this algorithm works is this. First assume you have a graph that has a partial encoding of the nearest neighbor relation, with some short and some long-distance links. If this graph is built in the right way (has the hierarchical navigable small world property), then you can efficiently traverse it to find nearest neighbors (approximately) in log N time where N is the number of nodes in the graph. I believe this idea was pioneered in [1]. The great insight in that paper is that if you use the graph search algorithm to find the K nearest neighbors of a new document while indexing, and then link those neighbors (undirectedly, ie both ways) to the new document, then the graph that emerges will have the desired properties.

The implementation I propose for Lucene is as follows. We need two new data structures to encode the vectors and the graph. We can encode vectors using a light wrapper around BinaryDocValues (we also want to encode the vector dimension and have efficient conversion from bytes to floats). For the graph we can use SortedNumericDocValues where the values we encode are the docids of the related documents. Encoding the interdocument relations using docids directly will make it relatively fast to traverse the graph since we won't need to lookup through an id-field indirection. This choice limits us to building a graph-per-segment since it would be impractical to maintain a global graph for the whole index in the face of segment merges. However graph-per-segment is a very natural at search time - we can traverse each segments' graph independently and merge results as we do today for term-based search.

At index time, however, merging graphs is somewhat challenging. While indexing we build a graph incrementally, performing searches to construct links among neighbors. When merging segments we must construct a new graph containing elements of all the merged segments. Ideally we would somehow preserve the work done when building the initial graphs, but at least as a start I'd propose we construct a new graph from scratch when merging. The process is going to be limited, at least initially, to graphs that can fit in RAM since we require random access to the entire graph while constructing it: In order to add links bidirectionally we must continually update existing documents.

I think we want to express this API to users as a single joint KnnGraphField abstraction that joins together the vectors and the graph as a single joint field type. Mostly it just looks like a vector-valued field, but has this graph attached to it.

I'll push a branch with my POC and would love to hear comments. It has many nocommits, basic design is not really set, there is no Query implementation and no integration iwth IndexSearcher, but it does work by some measure using a standalone test class. I've tested with uniform random vectors and on my laptop indexed 10K documents in around 10 seconds and searched them at 95% recall (compared with exact nearest-neighbor baseline) at around 250 QPS. I haven't made any attempt to use multithreaded search for this, but it is amenable to per-segment concurrency.

[1] https://www.semanticscholar.org/paper/Efficient-and-robust-approximate-nearest-neighbor-Malkov-Yashunin/699a2e3b653c69aff5cf7a9923793b974f8ca164

UPDATES:

(1/12/2020) The up-to-date branch is: https://github.com/apache/lucene-solr/tree/jira/lucene-9004-aknn-2

Migrated from LUCENE-9004 by Michael Sokolov (@msokolov), 16 votes, resolved Dec 30 2020 Attachments: hnsw_layered_graph.png Linked issues:

SOLR-15880

asfimport commented 5 years ago

ASF subversion and git services (migrated from JIRA)

Commit 2aba8b89796150760fa8939452b5d9271d7b5b7b in lucene-solr's branch refs/heads/LUCENE-9004 from Michael Sokolov https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2aba8b8

LUCENE-9004: approximate nearest vector search (WIP)

asfimport commented 5 years ago

Trey Grainger (@treygrainger) (migrated from JIRA)

This looks really awesome @msokolov. I wasn't too familiar with HNSW before, but just checked out the HNN benchmarks and it definitely looks like the most effective option of the ones included in the ANN bencharks (best QPS / Recall numbers): https://github.com/erikbern/ann-benchmarks

On my end in the Solr issue, I think we want to start with the vector scoring problem first and come back to the ANN problem afterward, which seems like your primary focus at the moment.

Is your VectorDocValuesField pretty solid at the point? Wondering if we should start with that as the basis for the scoring work and focus on the query / scoring part and later circle back to include the ANN filtering.

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Pretty cool! I don't know HNSW so I can't comment on that part but it made me wonder about a couple things:

+1 to per-segment structure and rebuild graphs when merging.
You hacked doc-value formats for this POC, but I guess your end idea would have a dedicated file-format (in the Lucene API sense) to support this, e.g. VectorFileFormat, like we have PostingsFormat or PointsFormat?
You added a TODO about supporting ints and floats, I worry this would complicate things too much. Supporting float only has a great advantage which is that you can compute distances with doubles and never have to worry about overflows or underflows. This would be much more challenging if we supported doubles. Regarding ints, codecs could optimize for the case when all dimensions don't have a fractional part (bfloat16 is another type that we might want to optimize for).
You said there is "no Query implementation", but I suspect getting one will be challenging with the current Query API which requires ordered iterators of doc IDs and accept arbitrary filters. So if you were to intersect with a selective filter, you wouldn't be able to know up-front how many nearest-neighbors you'd need to filter. Something like LongDistanceFeatureQuery or LatLonPointDistanceFeatureQuery which further filters documents as more documents get collected would be nice, but this sounds very challenging with high numbers of dimensions?

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I agree with the general idea, Trey, of getting the vector scoring support in place, and the data structures to support that, as the first step. The VectorDocValues I posted is pretty good I think, for forward-only iteration over vectors. It doesn't do much apart from recording the dimension and providing methods for retrieving float[] per document/field. The BinaryDocValues format may spend some extra bytes recording the per-document value lengths or it may ? optimize for the case where all lengths are equal - at any rate it certainly could, without the need for a separate class.

I think it is important though to have some idea how we are going to use these data structures in algorithms since that will inform the data design. For example, while coding this up I learned about the importance of supporting random access, something our DocValues implementations are not optimized for. To Adrien's point about a specialized Format, I think that would enable more optimal random access. The DocValues API basically requires creating a new DocValues whenever docids go backwards, so it's not ideal for this use case, although for scoring only it's probably fine, since scoring will be done in docid order. I didn't try to bite that all off at once, drawing the line at making no codec-level changes for this POC.

Re: supporting other numeric types in vectors - I think this adds probably adds needless complication at this point; the API can be evolved later. We should follow the example of Points which I think only encodes float? Precision / size tradeoff seems right for storage and at this stage we should favor flexibility over bit-packing optimizations.

I do see that Query implementation can be very challenging! I had not fully appreciated the problem honestly, so thanks for the foresight Adrien. We can limit the problem by definition though, framing the Query as KnnGraphQuery(float[] target, int topK) where topK is not necessarily the same as the number of results requested. This topK is a filtering criterion, defining a radius of inclusion for the query, and is required to be supplied by the constructor of the query. With that definition, we are free to retrieve all results up front, rank them by docid and iterate over them. I'm not sure what else we can do that's sensible? We can expand backwards over the graph, to include an ever-widening circle (bigger topK), but the expansion will not be related to docid order at all.

Here's a strategy for getting to something committable:

Get the VectorDocValues tightened up and usable for ranking. We should be able to commit this while the remainder is cooking. Even if we replace it ultimately within the context of nearest-vector search, I think it can have some value on its own for other use cases, and we can continue to use it here for a while.
FInish testing graph segment-merge for sorted index and in face of deletions.
Get the HNSW implementation in better shape. Today it does not incorporate the "hierarchical" part of the algorithm, and I haven't done any testing with real data, just synthetic vectors. I think we do want to do some more experimentation here before locking down an encoding format. One of the oddities of this algorithm is that you run multiple traversals across the graph, starting from "random" entry points. In the hierarchical version, these random points are chosen from a top-level "tier" that have long-distance links, and then there is a tree-like hierarchy of levels in the graph that enables more efficient traversal. I think this tiering needs to be explored in order to inform the storage format. Until we understand that better, we can continue to use the DocValues storage for both vectors and graph links.
Implement a KnnGraphQuery as described above
Return to the storage layer and lock down an efficient format. I think there could be some improvement from using a memory-mapped FloatBuffer to directly read floats from disk and avoid the copying into an external float[], and as I said we want to better support random access and avoid copying the entire contents of a DocValues field into RAM when we merge. It seems a new Format may be needed in order to enable that.

I won't work on this for a little while since I have some heavy day-job responsibilities next week, then I will be at ApacheCon EU, then two weeks holiday, but if someone else wants to pick up the VectorDocValues or make something better, I'll be happy :) In any case I'll return to this eventually and push ahead.

asfimport commented 5 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

I also really look forward to get this feature into Lucene!

FWIW, it seems like that there are several versions of the HNSW paper and I think this is the latest revision (v4). Pseudo algorithm parts have been refined or evolved from the original version (@msokolov introduced here) though I've not yet closely checked the diffs and have no idea about they have significant impacts here.

https://arxiv.org/abs/1603.09320

(I will check out / play with the PoC branch and share my findings, if it would be useful.)

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

@mocobeta it would be great if you helped. What I posted here is based on the very first version of the algorithm - it is not hierarchical yet! I intend to work it up to the version described in the "efficient and robust..." paper you linked. One thing this definitely needs is testing with realistic data. For example it would be awesome if we could use some standard word embeddings to create an open-source test set based on wikipedia? Or maybe there is an existing public test set from academia we could use?

asfimport commented 5 years ago

Doug Turnbull (@softwaredoug) (migrated from JIRA)

@msokolov - should we hook into ANN Benchmarks? Seems to be what people are using to compare these implementations https://github.com/erikbern/ann-benchmarks

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

@softwaredoug that ANN benchmarks looks very promising!

asfimport commented 5 years ago

David Smiley (@dsmiley) (migrated from JIRA)

The DocValues API basically requires creating a new DocValues whenever docids go backwards, so it's not ideal for this use case

I thought it was a real shame that the random access API to DocValues was removed. I'm skeptical it was actually necessary for sparse docValues but that was the rationale. Any way, for vectors, do you think we need an entire new Codec level format or can we just have a new type of DocValues that is random access BinaryDocValues? For example imagine a DocValuesProducer.getBinaryRandomAccess()? And/or imagine BinaryDocValues allowing you to call advance() on any ID have this be acceptable – needn't be forward-only?

asfimport commented 5 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

Hi, I've been trying to understand the Hierachical NSW paper, its previous work NSW model (https://publications.hse.ru/mirror/pubs/share/folder/x5p6h7thif/direct/128296059), and the PoC implementation.

Just to clarify the discussion (and my understanding) here, I'd like to leave my comments about data structures/encodings.

We need two data structures as Michael Sokolov described, 1) the vector (value tied up to a vertex) and 2) layered undirectional graph (I think the figure below helps our understanding).
1) the vector (float array) can simply be represented by BinaryDocValues as current PoC. For more efficient data access for large indexes, it may be required to introduce new Formats.
Indeed we need random access - i-t's not related to 1) the vector (the value) representation itself but required for 2), to traverse/search the "undirectional" graph.-
Additionally, we need a skip-list like structure to encode the hierarchical graph (still not implemented).

My feeling is that at least we need a new Format to represent 2), layered undirectional graph. Current PoC branch encodes the Layer 0 by SortedNumericDocValues, however, we will eventually intend to introduce multiple layers, it wouldn't be possible to represent those by existing doc values. (Please correct me if this is not true :) )

Or would it be possible that we introduce a new, dedicated auxiliary data structure / algorithm for HNSW apart from postings lists, like FST? I mean, for layered undirectional graph construction/traversal, we could have o.a.l.util.hnsw package. It's just an idea and I'm now attempting to delve into that... @msokolov have you considered it so far?

asfimport commented 5 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

Indeed we need random access - it's not related to 1) the vector (the value) representation itself but required for 2), to traverse/search the "undirectional" graph.

ah sorry it would not be correct, we would need random access for both of the vector and the graph... Please ignore this line.

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

@dsmiley I didn't reply for a while because I'm not sure what the best approach is to dealing with random access. The problems with the current approach are:

The DocValues API contract forces you to create a new DV iterator when you want to go backwards, wasting objects and any work done to initialize the iterator (both probably small, but you do this pretty often)
The DocValues on-disk encoding is not the most efficient for random access; it is heavily optimized for efficient forward-only iteration.
Updates to DocValues are handled by allocating objects that are applied on merge, but we make a great many updates to the graph while indexing and this isn't tenable, so we need to be able to update in place.

For the implementation I think we'd want to have some kind of efficient lookup addressing structure mapping docid -> values payload, with the values themselves in a separate densely-packed block. The simplest thing would be indirection through an array with an entry (int) for every document, but that could be made more space-efficient for cases where the doc values are sparsely populated. I guess such a concept could be applied to any DocValues, and we could use DocValues API to access such a thing, but since most use cases are handled well with the forward iteration, it would make sense to me to explore this implementation in a separate special-use project? Also I haven't yet seen any particular benefit in using the DocValues API for graphs: might as well make something new?

The proposal to create a random-access DocValues would require no change to the API (we have all the methods we need), but the assumption that docids go forwards is deeply baked in at this point not only in the current implementations, but also in the test framework and I think in some abstractions (like DocValuesIterator) that are shared by all implementations. Maybe that could be relaxed in a sane way? I haven't really tried to see. Or we could make something new. Would that be a Format specific to this graph use case? Or a RandomAccessDocValues that can support a few different cases? I tend to not want to create abstractions until we have at least a few concrete cases for them.

@mocobeta - thanks for the excellent exposition (and picture!). You are right, the branch I posted doesn't encode that hierarchical structure yet. I think it could be done using multiple SortedNumericDocValues fields, one for each level. I'm not saying this is a good design, merely that it's possible.

I think the idea of a custom format to represent the graph is pretty much what you are describing? It gives you the freedom to just do something new. As I understand it, existing formats are things like postings, doc-values, points, or stored fields - a separate on-disk "file" effectively.

asfimport commented 5 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

maybe it could be helpful to have a sort of baseline to test this stuff with, me and Jimmy Lin worked on plain Lucene based aNN in Lucene within Anserini, see https://github.com/castorini/anserini/blob/master/docs/approximate-nearestneighbor.md . In my experience what makes this stuff hard is keeping both good recall (as precision could be refined in a reranking step) and good (enough) speed. I agree that for HNSW it seems like we would need a dedicated format for the graph.

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

> maybe it could be helpful to have a sort of baseline

absolutely, thanks for pointing that out @tteofili. This HNSW algorithm is supposed to be state-of-the-art according to papers, let's compare and make sure our implementation realizes that. Yes my first experiment was horrendously slow, but 100% precision. There are a few tunable parameters; I think tuning may vary with datasets too, so there will be some interesting bench-marking work to do here.

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

So after I wrote this:

The DocValues on-disk encoding is not the most efficient for random access; it is heavily optimized for efficient forward-only iteration.

I went and looked more closely at the Lucene80DocValuesFormat, and I think I spoke too soon - bad habit! The formats described there actually seem like a pretty reasonable basis for exposing random access. Sure it's not free, but if you were to go about implementing a concise data structure for random access API over non-fully-occupied (ie dense or sparse) per-document data, I don't see that it would end up looking a whole lot different to this under the hood. EG we have jump tables for seeking to the block enclosing a doc, and then jump tables within the block (for DENSE encoding). I'll open a separate issue for enabling random access in IndexedDISI.

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

In the meantime, @mocobeta posted this PR adding HNSW!

asfimport commented 5 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

Thanks for mentioning, I was working on this issue for couple of weeks and here is my WIP/PoC branch (actually it's not a PR, because "Query" part is still missing). https://github.com/mocobeta/lucene-solr-mirror/commits/jira/LUCENE-9004-aknn

I borrowed @msokolov's idea but took different implementation approach:

Introduce new codec (Format, Writer, and Reader) for the graph part. The new GraphFormat can express multi level (document) graph.
Introduce new doc values field type for the vector part. The new VectorDocValues shares the same codec to BinaryDocValues but provides special functionalities for dense vector handling: encode/decode float array to/from binary value, keep num of dimensions and distance function, and allow random access to underlying binary doc values. (For now I just reset IndexedDISI when seeking backwards.)

It works but there are indexing performance concerns (due to costly graph construction). Anyway I hope I can create a PR with working examples before long...

asfimport commented 5 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

Just for status update: my PoC branch is still on pretty early stage and works only on one segment, but now it can index and query arbitrary vectors by this example code. The newly added KnnGraphQuery is an extension of Query class so it should be able to be combined with other queries with some limitations, because the knn query cannot score entire dataset in nature. Indexing performance is terrible for now (it takes a few minutes for a hundred of thousands vectors w/ 100 dims on commodity PCs), but searching doesn't look too bad (it takes \~30 msec for the same dataset) thanks to the skip list-like graph structure.

On my current branch I wrapped BinaryDocValues to store vector values. However, exposing random access capability for doc values (or its extensions) can be controversial, so I'd like to propose a new codec which combines 1. the HNSW graph and 2. the vectors (float arrays).

The new format for each vector field would have three parts (in other words, three files in a segment). They would look like:


 Meta data and index part:
 +--------------------------------------------------+
 | meta data                                        |
 +--------+-----------------------------------------+
 | doc id | offset to first friend list for the doc |
 +--------+-----------------------------------------+
 | doc id | offset to first friend list for the doc |
 +--------+-----------------------------------------+
 |              ......                              |
 +--------+-----------------------------------------+

 Graph data part:
 +-------------------------+---------------------------+---------+-------------------------+
 | friends list at layer N | friends list at layer N-1 |  ...... | friends list at level 0 | <- friends lists for doc 0
 +-------------------------+---------------------------+---------+-------------------------+
 | friends list at layer N | friends list at layer N-1 |  ...... | friends list at level 0 | <- friends lists for doc 1
 +-------------------------+---------------------------+---------+-------------------------+
 |                            ......                                                       | <- and so on
 +-----------------------------------------------------------------------------------------+

 Vector data part:
 +----------------------+
 | encoded vector value | <- vector value for doc 0
 +----------------------+
 | encoded vector value | <- vector value for doc 1
 +----------------------+
 |   ......             | <- and so on
 +----------------------+

"meta data" includes: number of dimensions, distance function for similarity calculation, and other field level meta data
"doc id" is: doc ids having a vector value on this field
"friends list at layer N" is: a delta encoded target doc id list where each target doc is connected to the doc at Nth layer
"encoded vector value" is: a fixed length byte array. the offset of the value can be calculated on the fly. (limitations: each document can have only one vector value for each vector field)

The graph data (friends lists) is relatively small so we could keep all of them on the Java heap for fast retrieval (though some off-heap strategy might be required for very large graphs). The vector data (vector values) is large and only the small fraction of it is needed when searching, so they should be kept on disk and accessed by some on-demand style.

Feedback is welcomed.

And I have a question about introducing new formats - is there a way to inject XXXFormat to the indexing chain so that we can add in this feature without any change on the lucene-core?

asfimport commented 4 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

Hi, it seems that some devs are strongly interested in this issue and I privately have received feedback (and expectations). So I just wanted to share my latest WIP branch. https://github.com/mocobeta/lucene-solr-mirror/commits/jira/LUCENE-9004-aknn-2 And here is an usage code snippet for that: https://gist.github.com/mocobeta/a5b18506ebc933c0afa7ab61d1dd2295

I introduced a brand new codec and indexer for vector search so this no longer depends on DocValues, though it's still on pretty early stage (especially, segment merging is not yet implemented).

I intend to continue to work and I'll do my best, but to be honest I am not sure if my approach is the best - or I can create a great patch that can be merged to Lucene core... I welcome that someone takes over it in some different, more sophisticated/efficient ways. My current attempt might be useful as a reference or the starting point.

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

@mocobeta you are too modest! The LUCENE-9004-aknn-2 is in pretty good shape! It is functionally correct now, implementing the hierarchical version of the paper sited in the LUCENE-9004 issue. Also, I believe with the patch I posted there, we now have merging working, and I think search across multiple segments falls out naturally since you implemented a Query that can be collected in the usual way across segments.

I also did some comparisons with the C/C++ version in https://github.com/nmslib/hnswlib, the reference implementation, and got similar overlap results with vanilla hyper-parameter choices, so I am pretty confident you have faithfully reproduced that algorithm. Now it has to be said that performance is not what we would want - it's quite a bit slower than the C/C++ version. I haven't had a chance to dig in to the cause yet, but I suspect we could be helped by ensuring that vector computations are done using vectorized instructions. We might also be able to reduce object instantiation here and there.

I think it's time to post back to a branch in the Apache git repository so we can enlist contributions from the community here to help this go forward. I'll try to get that done this weekend

asfimport commented 4 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

@msokolov thanks, I myself also have tested it with a real dataset that is generated from recent snapshot files of Japanese Wikipedia. Yes it seems like "functionally correct", although we should do more formal tests for measuring Recall (effectiveness).

I think it's time to post back to a branch in the Apache git repository so we can enlist contributions from the community here to help this go forward. I'll try to get that done this weekend

OK, I pushed the branch to the Apache Gitbox to let others who want to involve in this issue check out it and have a try. While I feel it's far from being complete :), but agree with that the code is prepared to take in contributions from the community. https://gitbox.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/jira/lucene-9004-aknn-2 This also includes a patch from Xin-Chun Zhang. Note: currently the new codec for the vectors and kNN graphs is placed in o.a.l.codecs.lucene90, I think we can move this to proper location when this is ready to be released.

asfimport commented 4 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

Hello and thank you for this very exciting work! We have been doing research into nearest neighbor search on high-dimensional vectors and I wanted to share some thoughts here in the hope that they're helpful.

Related to Adrien's comment about search filters, I am wondering how deleted documents would be handled. If I'm understanding correctly, a segment's deletes are applied 'on top of' the query. So if the k nearest neighbors to the query vector all happen to be deleted, then the query won't bring back any documents. From a user's perspective, I could see this behavior being surprising or hard to work with. One approach would be to keep expanding the search while skipping over deleted documents, but I'm not sure about the performance + accuracy it would give (there's a short discussion in the hnswlib repo on this point).

The recent paper Graph based Nearest Neighbor Search: Promises and Failures compares HNSW to other graph-based approaches and claims that the hierarchy of layers only really helps in low dimensions (Figure 4). In these experiments, they see that a 'flat' version of HNSW performs very similarly to the original above around 16 dimensions. The original HNSW paper also cites the hierarchy as most helpful in low dimensions. This seemed interesting in that it may be possible to avoid some complexity if the focus is not on low-dimensional vectors. (It also suggests that graph-based kNN is an active research area and that there are likely to be improvements + new approaches that come out. One such new approach is DiskANN Fast Accurate Billion-point Nearest Neighbor Search on a Single Node).

On the subject of testing recall, I'm working on adding sentence embedding and deep image descriptor datasets to the ann-benchmarks repo. Hopefully that will help in providing realistic shared data to test against.

asfimport commented 4 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

Hi @jtibshirani, thanks for you comments/suggestions. I will check the links you mentioned.

It also suggests that graph-based kNN is an active research area and that there are likely to be improvements + new approaches that come out.

Yes, there are so many proposed methods and their variants in this field. Currently I'm not fully sure that what is the most feasible approach for Lucene.

Also, I just noticed an issue that proposes a product quantization based approach - roughly speaking, it may need less disk and memory space than the graph based methods like HNSW but takes more indexing and query time costs: #10177

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I'll second the thanks, @jtibshirani . There's clearly active work going on, and it may be too soon to declare a single winner in this complex space. I do think there is a need to focus on higher-dimensional cases since in Lucene there is already well-developed support for dim <=8 via KD-tree, but nothing for higher dimensions.

One thing that surprises me a bit about some evaluations I'm seeing is that they report Precision@1 (and sometimes even when operating over the training set?!). I wonder if anyone has looked at a metric that includes top 10 (say), and penalizes more distant matches? For exmaple MSE over normalized vectors would enable one to distinguish among results that are both the same "precision" yet one has vectors that are closer than the other.

Re: deletions, yeah we have not addressed that. The only thing that makes sense to me for deletions is to prune them while searching. TBH I'm not sure how to plumb livedocs in to the query, or if this is somehow untenable? Supposing we do that, it would impose some operational constraints in that if a lot of documents are deleted, performance will drop substantially, but I think that is probably OK. Users will just have to understand the limitation? We'll have to understand the impact as deletions accumulate.

I think the issue about filtering against other queries is more challenging since we don't have an up-front bitset to filter against, typically. In a sense the ANN query is the most expensive because every document is a potential match. Perhaps the thing to do is come up with an estimate of a radius R bounding the top K (around the query vector) based on the approximate top K we find, and then allowing to advance to a document, even if it was not returned by graph search, so long as its distance is <= R. This would not truly answer the question "top K closest documents satisfying these constraints," though. For that I don't see what we could do other than forcing to compute a bitset, and then passing that in to the graph search (like for deletions).

asfimport commented 4 years ago

Xin-Chun Zhang (migrated from JIRA)

Hi, @msokolov , I created a related issue [#LUCENE-9136] that attempts to introduce IVFFlat algorithm to Lucene. IVFFlat is widely used in many fields, from computer vision to speech recognition for its smaller index and memory usage. And the algorithm can be accelerated using GPU parallel computing, making it faster and more accurate than HNSW.

My personal branch is available in github https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136. The index format (one meta file with suffix .ifi) of IVFFlat is described in the class [Lucene90IvfFlatIndexFormat|[https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90IvfFlatIndexFormat.java]]. In my implementation, the k-means clustering was optimized when the number of vectors is very large (e.g. > 200,000 per segment). A subset after shuffling is selected for training, thereby decreasing time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the clustering.

Even if HNSW uses a cache for graphs while IVFFlat has no cache, my test cases show that the query performance of IVFFlat is better than HNSW, and its recall is pretty high (avg time <10ms and recall > 96% over a set of 50000 random vectors with 100 dimensions). My test class for IVFFlat is under the directory https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be fixed and I would like to hear more comments.

asfimport commented 4 years ago

Xin-Chun Zhang (migrated from JIRA)

Is there any possible to merge #10177 with this issue?

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

> Is there any possible to merge #10177 with this issue?

This is already gigantic - what would be the benefit of merging?

asfimport commented 4 years ago

Xin-Chun Zhang (migrated from JIRA)

"This is already gigantic - what would be the benefit of merging?"

– Yes, I agree that it's gigantic. It's only a personal proposal based on the following considerations,

Both issues concern about the same problem that searching approximate nearest neighbor in vector space, which implies that the key parts of the design and implementation could be reused, e.g. vector format and the corresponding reader/writer. The implementation could be then more elegant.
Moreover, we could make sure that the provided interfaces are consistent and compatible.

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Xin-Chun ZhangI haven't had a chance to review the IVFFlat work you did, so I'm not sure how much code-sharing there would likely be, but I tend to think we should allow these to evolve in parallel until we have something ready to commit – and we should aim to commit the minimum viable thing, and then, add to it. Maybe we'd commit IVFFlat first, I don't know. Is it making life difficult to keep them separate?

asfimport commented 4 years ago

Yury Malkov (@yurymalkov) (migrated from JIRA)

Hi, I am an author of the HNSW paper. Feel free to ping me if you have any question about it or graph algorithms in general.

Xin-Chun Zhang I am a little bit surprised that IVF is faster in your tests. I suspect that might be due to random high dimensional data.

Have you tried comparing them on real data?

Also, thanks for looking into HNSW :)

asfimport commented 4 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Have you tried comparing them on real data?

+1 to test these approaches on actual vectors ... when you test on synthetic data you draw synthetic conclusions!

asfimport commented 4 years ago

Xin-Chun Zhang (migrated from JIRA)

"Is it making life difficult to keep them separate?"

@msokolov No, we can keep them separate at present. I have merged your [branch|[https://github.com/apache/lucene-solr/tree/jira/lucene-9004-aknn-2]] into my person [github|[https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136]] in order to do the comparison between IVFFlat and HNSW. And I reused some work that @mocobeta and you did. Code refactoring is required when we are going to commit.

"Have you tried comparing them on real data?"

@yurymalkov, @mikemccand Thanks for your advice. I haven't do it yet, and will do it soon.

Update – Feb. 4, 2020

I have added two performance test tool (KnnIvfPerformTester/KnnIvfAndGraphPerformTester) into my personal branch. And sift1M dataset (1000,000 base vectors with 128 dimensions, http://corpus-texmex.irisa.fr/) is employed for the test. Top 1 recall performance of IVFFlat is as follows, a new IndexReader was opened for each query,

centroids=707

nprobe	avg. search time (ms)	recall percent (%)
8	71.314	69.15
16	121.7565	82.3
32	155.692	92.85
64	159.3655	98.7
128	217.5205	99.9

centroids=4000

nprobe	avg. search time (ms)	recall percent (%)
8	56.3745	65.35
16	59.5435	78.85
32	71.751	89.85
64	90.396	96.25
128	135.3805	99.3

Unfortunately, I couldn't obtain the corresponding results of HNSW due to the out of memory error in my PC. A special case with 2,000 base vectors demonstrates that IVFFlat is faster and more accurate. HNSW may outperform IVFFlat on larger data sets when larger memory is available, as shown in https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors.

asfimport commented 4 years ago

Yury Malkov (@yurymalkov) (migrated from JIRA)

Xin-Chun Zhang

Are you sure the results are accurate?

They are way higher than they should be. E.g. HNSW (in hnswlib implementation) got recall >0.9 on SIFT 1M in sub-0.1ms time in one thread even on old CPUs. And it uses less than 0.5 Gb of RAM for SIFT1M.

asfimport commented 4 years ago

Xin-Chun Zhang (migrated from JIRA)

Are you sure the results are accurate?

The aforementioned results are generated by the test tools, as shown in the attached file. It seems that the Java version of IVFFlat and HNSW are quite slower than the C/C++ version. There may be some causes, e.g. the differences between programming languages, different parameter settings and so on.

The nmslib (hnsw) did run fast and consumes less memory. I will check why it throws OOM exception.

!屏幕快照 2020-02-04 上午10.38.26.png!

asfimport commented 4 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

Unfortunately, I couldn't obtain the corresponding results of HNSW due to the out of memory error in my PC.

Current HNSW implementation requires 4GB heap for 1M dataset / 200 dimension vectors (we need to reduce the memory consumption). The default heap size that is given to Java processes depends on platforms, but for most commodity PCs it wouldn't be so large so you will see OOM if you are not set the -Xmx JVM arg.

asfimport commented 4 years ago

Xin-Chun Zhang (migrated from JIRA)

The default heap size that is given to Java processes depends on platforms, but for most commodity PCs it wouldn't be so large so you will see OOM if you are not set the -Xmx JVM arg.

@mocobeta I did set JVM option to "-Xmx8192m", but OOM error always appears. I guess there may be a memory leak in the static member "cache" of HNSWGraphReader. The key of "cache" is composed of field name and context identity, where the context identity may vary from query to query. When I execute query multiple times, the static cache size increases rapidly (cache size equals to query times), result in OOM.

asfimport commented 4 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

The context is created per reader basis, not per query basis. You don't share your test code, but I suspect you open new IndexReader every time you issue a query? I think if you reuse one index reader (index searcher) through the test, the memory usage is stable between 2 and 4 GB. Anyway, yes, the static cache (for the graph structure) isn't good implementation, that is one reason why I said the HNSW branch is still on pretty early stage...

asfimport commented 4 years ago

Xin-Chun Zhang (migrated from JIRA)

You don't share your test code, but I suspect you open new IndexReader every time you issue a query?

@mocobeta The test code can be found in https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/KnnIvfAndGraphPerformTester.java. Yes, I opened a new reader for each query in hope that IVFFlat and HNSW are compared in a fair condition since IVFFlat do not have cache. I now realize it may lead to OOM, hence replacing with a shared IndexReader and the problem resolved.

Update – Top 1 in-set (query vector is in the candidate data set) recall results on SIFT1M data set (http://corpus-texmex.irisa.fr/) of IVFFlat and HNSW are as follows,

IVFFlat (no cache, reuse IndexReader)

nprobe	avg. search time (ms)	recall percent (%)
8	13.3165	64.8
16	13.968	79.65
32	16.951	89.3
64	21.631	95.6
128	31.633	98.8

HNSW (static cache, reuse IndexReader)

avg. search time (ms)	recall percent (%)
6.3	20.45

It can readily be shown that HNSW performs much better in query time. But I was surprised that top 1 in-set recall percent of HNSW is so low. It shouldn't be a problem of algorithm itself, but more likely a problem of implementation or test code. I will check it this weekend.

asfimport commented 4 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

As another option for running benchmarks I wanted to give more information on the ann-benchmarks repo. It’s a shared set of benchmarks developed by the kNN search community, and contains a set of pretty realistic datasets, as well as connectors to existing kNN libraries like FAISS.

I pushed a branch that hooks up the Lucene HNSW prototype to ann-benchmarks: https://github.com/jtibshirani/ann-benchmarks/pull/1. It’s nice to have everything in one place, as we can compare prototypes against reference implementations from FAISS to check that the recalls match. This comment contains results of running both the HNSW prototype and FAISS’s implementation against a small test dataset. It looks like the prototype gives \~5% lower recall for the same parameter values, which suggests there’s room for small fixes/ improvements in terms of the algorithm. (I might have misunderstood the default parameter values though, any corrections are welcome!)

Some more background:

That test uses a small test dataset because I had trouble getting forceMerge(1) to complete on a large dataset. But there are some more realistic datasets like glove-100-angular (a set of 1.2 million GloVe word vectors), and deep-image-96-angular (a set of 10M 'deep descriptors' of images from a CNN).
By default, ann-benchmarks retrieves k=10 nearest neighbors, and reports recall as the number of results that overlap with the true k nearest neighbors. There is an adjustable small ‘fudge factor’ epsilon, so that a result is still counted as correct if it is within a small distance of the true kth nearest neighbor.
Since ann-benchmarks is a Python library, the branch uses py4j to convert to and from Java. py4j could add non-trivial overhead, so this benchmarking strategy is probably not best for measuring raw QPS. But it can be useful to (1) examine recall numbers, and (2) compare different Lucene kNN approaches against each other.

Feel free to ping me/ comment on the PR if you spot issues or have trouble getting it to work.

asfimport commented 4 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

@jtibshirani What kinds of trouble did you have with forceMerge(1)? It can certainly take a long time to complete, but I've rarely seen other problems with it. Assuming you're using the default TieredMergePolicy that is...

asfimport commented 4 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

@ErickErickson the benchmark force merges the index to one segment to ensure we're doing an 'apples to apples' comparison between the Lucene data structure and the FAISS implementation. It was taking > 8 hours to complete on the GloVe dataset, and it seems that the time is spent merging the kNN graph. If you have suggestions on the benchmark set-up, maybe we could discuss on the PR so I don't add a lot of noise to the issue?

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

@ErickErickson this HNSW graph format is expensive to merge since it must recreate its graph on every merge, for all the documents. @mocobeta had an interesting idea about how to save some of that work when merging by joining together per-segment graphs, but I don't think it preserved the right graph properties. I'm optimistic we can find a way to do something like that though.

asfimport commented 4 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Which is even more reason not to have a more-than-one-pass optimize, correct?

Which bring up an interesting end-to-end indexing speedup. Limited use-cases but...

Do the original indexing with NoMergePolicy, playing games with when Lucene flushes segments, you'd want the initial flush to be as large as possible. The idea here is to create, say, 5G segments (or whatever) during indexing without merging. Then the optimize step merges all the segments at once.

asfimport commented 4 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

It can readily be shown that HNSW performs much better in query time. But I was surprised that top 1 in-set recall percent of HNSW is so low. It shouldn't be a problem of algorithm itself, but more likely a problem of implementation or test code. I will check it this weekend.

Thanks Xin-Chun Zhang for measuring. I noticed I might have made a very basic mistake when comparing neighborhood nodes, maybe some inequality signs should be flipped :/ I will do recall performance tests with GloVE and fix the bugs.

asfimport commented 4 years ago

Xin-Chun Zhang (migrated from JIRA)

As mentioned by @mocobeta, the static cache (for the graph structure) isn't a good implementation. I have some ideas to share,

Do not cache the whole graph structure on memory. We could cache the entry points rather than the whole graph structure. When searching the nearest neighbors, we only care about the neighbors of current point, and each point is visited only once. Therefore, we could read its neighbors when the point is actually visited. There's a simple implementation for HNSW (marked as searchNeighborsV2) in my personal branch (https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/java/org/apache/lucene/util/hnsw/HNSWGraphReader.java). The implementation could be further optimized, for example, we could store and read neighbors by layers because we only need the neighbors in current search layer.
If higher search performance is preferred, we could keep the whole graph structure in graph reader (similar to the implementation of Lucene90IvfFlatIndexReader, https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90IvfFlatIndexReader.java). The on-heap cache would be released when the reader is closed.

It's likely not the best solution to the static cache, just for discussions.

asfimport commented 4 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

I like this issue a lot and all the discussions around it, thanks all for working on this!

I'd like to share some of the findings we had while working on similar solutions for Elasticsearch. The known limitations for graph based approach are the computational cost of building the graph and the memory needed to store the neighbors per node. Regarding the computational cost, inserting a new node is equivalent to a query in the solution so we should expect that the number of comparisons needed to insert a node in the graph will grow logarithmically with the size of the data. We've made some tests to check the number of comparisons needed on the million scale and found out that this number doesn't vary too much on the dataset present in the ann-benchmark repo. To get good performance at search time, the efConstruction parameter need to be set high (from 200 to 800 in the best results) while M (max numbers of neighbors per node) can can remain lower (16 to 64). This led to around 10k comparisons in average for the ann-benchmark dataset in the 1-10M ranges.

10K comparisons for 1-10M ranges at query time is very compelling. Users can also trade some recall with performance and get acceptable results in the 1-10k ranges. However this trade-offs are more difficult to apply at build time where the quality of the graph is important to maintain. I mainly see this cost as static due to its logarithmic growth that is verified in the paper around small-world graph approaches. This is the main trade-offs that users need to make when using graph-based approaches, building will be slow.

Regarding the memory consumption, I have mixed feelings. The fact that we need to keep M nearest neighbors per node should not be a problem at search time since the graph can be static and accessed through a file. The random reads nature of a query in the graph will require disk seeks and reads but we retrieve M neighbors each time so we're not talking of tiny random reads and the filesystem cache will keep the hot nodes in direct memory (upper layer in the hierarchical graph?). I am saying this because it seems that we're expecting to load the entire graph in RAM at some point. I don't think this is needed at query time, hence my comment.

The tricky part in my opinion here is at build time where the graph is updated dynamically. This requires more efficient access and the ability to change the data. We also need to keep the nearest neighbor distances for each neighbor so the total cost is N*M*8 where N is the total number of documents, M the maximum number of neighbors per node and 8 the cost associated with keeping a doc id and the distance for each neighbor (int+float). The formula is slightly more complicated for hierarchical graph but doesn't change the scale. This memory requirement seems acceptable for medium-sized graph in the range of 1-10M but can become problematic when building large graphs of hundreds of millions nodes. Considering the logarithmic growth of the number of operations needed to find a local minimum when the dataset grows, building large graphs is encouraged at the expense of more memory. I don't know what would be acceptable but requiring tens of gigabytes of heap memories to build such graph doesn't seem compelling to me. Considering that the benefit of using a graph are already visible in the 1-10M ranges I also wonder if we could make a compromise and cap the size of the graphs that we build. So instead of having one graph per segment, we'd build N depending on how much memory the user is willing to allocate for the build and the total number of docs present in the segment. Obviously, searching these graphs sequentially would be more costly than having a single giant graph. However, this could also have interesting properties when merging segments since we wouldn't need to rebuild graphs that reached the maximum size allowed (assuming there's no deleted documents).

This is an idea that I wanted to not limit ourselves to the overall size of a single graph in a single segment of 2 billions vectors (maximum allowed per index in Lucene).

asfimport commented 4 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

I agree with @jimczi as discussions in this issue are super interesting. I like the idea of keeping per entry neighborhood information layer-wise only. I am wondering if this can be precomputed and maybe stored for faster computation when needed, e.g. sparsifying single node neighbors should be possible hence we could even reuse a temporary per node posting list.

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

@jimczi thanks for your insightful comments!

I've been working on an implementation again; this is basically a stripped-down (single layer; not hierarchical) version of @mocobeta's most recent patch. I started from that patch, but it has evolved quite a bit so I think it should get shared as a different thing. Here are some noteable differences:

The graph is built on heap only while flushing/merging and then is written to disk and afterwards searched by reading it from a file,with minimal on-heap data structures.
It has a simpler API that hides the HNSW graph reading/writing as an internal detail of the vectors format.
Nodes in the graph are identified by dense ordinals rather than docids. This allows for direct indexing of the vectors during graph search and leads to some efficiency when reading/writing the graph since friend lists are always built in sorted order. Graph ordinals must be mapped to/from docids, but only at the API boundary.

I tried to do as little as possible to get a viable NSW vector KNN search that we can build on/improve; it's not perfect, but builds on the clean APIs / new index format previously developed in this patch, does achieve reasonable performance in my tests, and I think can be a starting point. With that in mind, here are some big TODOs I left for future iterations:

Doesn't yet handled sorted indexes when merging
No handling of deleted docs while searching (they would be merged away, but docs deleted in the current segment are still included in search).
Doesn't impose maximum fanout on the graph (it's always symmetric right now)
Takes a simple-minded approach to manage the graph while indexing - per-segment graph, which is more costly to search than a single index-wide graph (or some hybrid) as pointed out above. I think we can find a way to have a graph span multiple segments, but leave this for later.
Nothing done re: expanding the search to handle filtering due to deletions and/or other query constraints).
Some data structures can be optimized for sure (there are Map<Integer, Long> that would be better unboxed, etc)

A few of those (deleted docs, sorted index handling) really need to be addressed right away, but I think this can at least be tested for performance/suitability of the API without them. Some of the other issues require exciting work that improves the state of the art. EG to expand the search after filtering, we should be able to make the search re-entrant by maintaining the visited-neighbors set. On the subject of more efficient merging, we may be able to merge two existing graphs by laying connections between them. Also - I think there is an opportunity to build the graph concurrently on multiple threads. This is a bit tricky and may sacrifice some recall depending on how it's done, but could be a good trade-off to get better indexing speed. But this is all in the future - to start with we need something basic that works.

So I'd like to make some progress here by getting something basic committed to "master" (aside: are we going to rename? I see github at least is moving to "main"?) that we can all iterate on and improve.

The patch is kind of big (around 5K LOC) so I think it needs to be broken up in order to be committed safely. A lot of the change is boilerplate required to introduce a new index format, so I hope we can commit that first to get it out of the way. My plan is to break it into two main commits:

Add the VectorValues format along the lines of #10362. I plan to pick up the discussion there and propose a simple implementation. This would get us vector storage and retrieval, but no efficient KNN search implementation. This change will include a lot of the plumbing that touches a lot of files, but so long as we are agreed that adding the format is a good idea, should be fairly straightforward :)
Add the NSW/KNN search implementation. I think we can do this without any significant public API changes. It will basically be an implementation of {VectorValues.search(float[] target, int k, int fanout)}, supported by a graph built when flushing the vector field.
Make it better! We can have more smaller issues once something basic is in place.

Small aside, @jimczi re cost of the graph:

> We also need to keep the nearest neighbor distances for each neighbor so the total cost is N*M*8

I think this is not so for the NSW algorithm; you just keep the docids, so N*M*4 is the cost. Still this doesn't change any of the scaling conclusions.

asfimport commented 4 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This is an exciting feature! It is not every day that a new Codec component is born! Nearest-neighbor vector search seems here to stay, so supporting this efficiently in Lucene makes sense.

+1 to @msokolov's plan above to break this complex and exciting feature into bite-sized steps! Future improvements (chunking the graph for a large segment, more efficient filtering with other query clauses, maybe adding hierarchy/layers to the graph) can come as subsequent iterations.

Handling the sorted index is hopefully not so difficult, since you already dereference the dense ordinals to docids.

asfimport commented 4 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Thanks @mikemccand - I posted a PR over in #10362 that is just for adding vectors, no KNN search yet.

apache / lucene

Approximate nearest vector search [LUCENE-9004] #10047