Introduce IVFFlat to Lucene for ANN similarity search [LUCENE-9136]

apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.72k stars 1.04k forks source link

Introduce IVFFlat to Lucene for ANN similarity search [LUCENE-9136] #10177

Open asfimport opened 4 years ago

asfimport commented 4 years ago

Representation learning (RL) has been an established discipline in the machine learning space for decades but it draws tremendous attention lately with the emergence of deep learning. The central problem of RL is to determine an optimal representation of the input data. By embedding the data into a high dimensional vector, the vector retrieval (VR) method is then applied to search the relevant items.

With the rapid development of RL over the past few years, the technique has been used extensively in industry from online advertising to computer vision and speech recognition. There exist many open source implementations of VR algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various choices for potential users. However, the aforementioned implementations are all written in C+, and no plan for supporting Java interface, making it hard to be integrated in Java projects or those who are not familier with C/C+ [[https://github.com/facebookresearch/faiss/issues/105]].

The algorithms for vector retrieval can be roughly classified into four categories,

Tree-base algorithms, such as KD-tree;
Hashing methods, such as LSH (Local Sensitive Hashing);
Product quantization based algorithms, such as IVFFlat;
Graph-base algorithms, such as HNSW, SSG, NSG;

where IVFFlat and HNSW are the most popular ones among all the VR algorithms.

IVFFlat is better for high-precision applications such as face recognition, while HNSW performs better in general scenarios including recommendation and personalized advertisement. The recall ratio of IVFFlat could be gradually increased by adjusting the query parameter (nprobe), while it's hard for HNSW to improve its accuracy. In theory, IVFFlat could achieve 100% recall ratio.

Recently, the implementation of HNSW (Hierarchical Navigable Small World, LUCENE-9004) for Lucene, has made great progress. The issue draws attention of those who are interested in Lucene or hope to use HNSW with Solr/Lucene.

As an alternative for solving ANN similarity search problems, IVFFlat is also very popular with many users and supporters. Compared with HNSW, IVFFlat has smaller index size but requires k-means clustering, while HNSW is faster in query (no training required) but requires extra storage for saving graphs [indexing 1M vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]]. Another advantage is that IVFFlat can be faster and more accurate when enables GPU parallel computing (current not support in Java). Both algorithms have their merits and demerits. Since HNSW is now under development, it may be better to provide both implementations (HNSW && IVFFlat) for potential users who are faced with very different scenarios and want to more choices.

The latest branch is lucene-9136-ann-ivfflat](https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)

Migrated from LUCENE-9136 by Xin-Chun Zhang, 3 votes, updated Dec 09 2020 Attachments: glove-100-angular.png, glove-25-angular.png, image-2020-03-07-01-22-06-132.png, image-2020-03-07-01-25-58-047.png, image-2020-03-07-01-27-12-859.png, sift-128-euclidean.png

asfimport commented 4 years ago

Xin-Chun Zhang (migrated from JIRA)

I worked on this issue for about three to four days. And it now works fine for searching.

My personal dev branch is available in github https://github.com/irvingzhang/lucene-solr/tree/jira/LUCENE-9136. The index format (only one meta file with suffix .ifi) of IVFFlat is shown in the class Lucene90IvfFlatIndexFormat. In my implementation, the clustering process was optimized when the number of vectors is sufficient large (e.g. > 200,000 per segment). A subset after shuffling is selected for training, thereby saving time and memory. The insertion performance of IVFFlat is better due to no extra executions on insertion while HNSW need to maintain the graph. However, IVFFlat consumes more time in flushing because of the k-means clustering.

My test cases show that the query performance of IVFFlat is better than HNSW, even if HNSW uses a cache for graphs while IVFFlat has no cache. And its recall is pretty high (avg time <10ms and recall>96% over a set of 50000 random vectors with 100 dimensions). My test class for IVFFlat is under the directory https://github.com/irvingzhang/lucene-solr/blob/jira/LUCENE-9136/lucene/core/src/test/org/apache/lucene/util/ivfflat/. Performance comparison between IVFFlat and HNSW is in the class TestKnnGraphAndIvfFlat.

The work is still in its early stage. There must be some bugs that need to be fixed and and I would like to hear more comments. Everyone is welcomed to participate in this issue.

asfimport commented 4 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

Hello Xin-Chun Zhang, to me this looks like a really interesting direction! We also found in our research that k-means clustering (IVFFlat) could achieve high recall with a relatively low number of distance computations. It performs well compared to KD-trees and LSH, although it tends to require substantially more distance computations than HNSW. A nice property of the approach is that it's based on a classic algorithm, k-means – it is easy to understand, and has few tuning parameters.

I wonder if this clustering-based approach could fit more closely in the current search framework. In the current prototype, we keep all the cluster information on-heap. We could instead try storing each cluster as its own 'term' with a postings list. The kNN query would then be modelled as an 'OR' over these terms.

A major concern about clustering-based approaches is the high indexing cost. K-means is a heavy operation in itself. And even if we only use subsample of documents during k-means, we must compare each indexed document to all centroids to assign it to the right cluster. With the heuristic of using sqrt(n) centroids, this could give poor scaling behavior when indexing large segments. Because of this concern, it could be nice to include benchmarks for index time (in addition to QPS). A couple more thoughts on this point:

FAISS helps address the concern by using an ANN algorithm to do the cluster assignment. In particular, it provides an option to use k-means clustering (IVFFlat), but do the cluster assignment using HNSW: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index#how-big-is-the-dataset. This seemed like a potentially interesting direction.
There could also be ways to streamline the k-means step. As an example, I experimented with FAISS's implementation of IVFFlat, and found that I could run very few k-means iterations, but still achieve similar performance. Here are some results on a dataset of \~1.2 million GloVe word vectors, using sqrt(n) centroids. The cell values represent recall for a kNN search with k=10:

approach 10 probes 20 probes 100 probes 200 probes random centroids 0.578 0.68 0.902 0.961 k-means, 1 iter 0.729 0.821 0.961 0.987 k-means, 2 iters 0.775 0.854 0.968 0.989 k-means, 20 iters 0.806 0.875 0.972 0.991

asfimport commented 4 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

Hello Xin-Chun Zhang, thanks for the updates and for trying out the idea. I was thinking we could actually reuse the existing PostingsFormat and DocValuesFormat implementations. This would have a few advantages:

It simplifies the implementation, since we avoid duplicating a good chunk of codec writing + reading logic.
I agree with your point that we don’t expect the cluster information to take up too much memory, since it just contains a map from centroid to the IDs of documents that belong to the cluster. I think there are still benefits to keeping the main data structures off-heap – we’d be better able to scale to large numbers of documents, especially when multiple vector fields are defined at once. There is also no ‘loading’ step, where the data must be read into an on-heap data structure before it can be used in queries.

I created this draft PR to sketch out the approach: https://github.com/apache/lucene-solr/pull/1314. I’m looking forward to hearing your thoughts! I included some benchmarks, and the QPS looks okay. The indexing time is \~1 hour for 1 million points – as we discussed earlier this is a major concern, and it will be important to think about how it can be improved.

asfimport commented 4 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

@jtibshirani Xin-Chun Zhang thanks for your hard work here!

I was thinking we could actually reuse the existing PostingsFormat and DocValuesFormat implementations.

Actually the first implementation (by Michael Sokolov) for the HNSW was wrapping DocValuesFormat to avoid code duplication. However, this approach - reusing existing code - could lead another concern from the perspective of maintenance. (From the beginning, Adrien Grand suggested a dedicated format instead of hacking doc values.) This is the main reason why I introduced a new format for knn search in #10047.

I'm not strongly against to the "reusing existing format" strategy if it's the best way here, just would like to share my feeling that it could be a bit controversial and you might need to convince maintainers that the (pretty new) feature does not cause any problems/concerns on future maintenance for Lucene core, if you implement it on existing formats/readers.

I have not closely looked at your PR yet - sorry if my comments completely besides the point (you might already talk with other committers about the implementation in another channel, eg. private chats?).

asfimport commented 4 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

Hello @mocobeta! My explanation before was way too brief, I'm still getting used to the joint JIRA/ GitHub set-up :) I'll give more context on the suggested direction.

The draft adds a new format VectorsFormat, which simply delegates to DocValuesFormat and PostingsFormat under the hood:

The original vectors are stored as BinaryDocValues.
The vectors are also clustered through k-means clustering, and the cluster information is stored in postings format. In particular, each cluster centroid is encoded to a BytesRef to represent a term. Each document belonging to the centroid is added to the postings list for that term.

Given a query vector, we first iterate through all the centroid terms to find a small number of closest centroids. We then take the disjunction of all those postings enums to obtain a DocIdSetIterator of candidate nearest neighbors. To produce the score for each candidate, we load its vector from BinaryDocValues and compute the distance to the query vector.

I liked that this approach didn't introduce major new data structures and could re-use the existing formats. To respond to your point, one difference between this approach and HNSW is that it’s able to re-use the formats without modifications to their APIs or implementations. In particular, it doesn’t require random access for doc values, they are only accessed through forward iteration. So to keep the code as simple as possible, I stuck with BinaryDocValues and didn’t create a new way to store the vector values. However, the PR does introduce a new top-level VectorsFormat as I thought this gave nice flexibility while prototyping.

There are two main hacks in the draft that would need addressing:

It's fairly fragile to re-use formats explicitly since we write to the same files as normal doc values and postings – I think there would be a conflict if there were both a vector field and a doc values field with the same name.
To write the postings list, we compute the map from centroid to documents in memory. We then expose it through a hacky Fields implementation called ClusterBackedFields and pass it to the postings writer. It would be better to avoid this hack and not to compute cluster information using a map.

Even apart from code-level concerns, I don't think the draft PR would be ready to integrate immediately. There are some areas where I think further work is needed to determine if coarse quantization (IVFFlat) is the right approach:

It would be good to run tests to understand how it scales to larger sets of documents, say in the 5M - 100M range. We would probably want to scale the number of centroids with the number of documents – a common heuristic is to set num centroids = sqrt(dataset size). Looking at the FAISS experiments, it can be helpful to use an even higher number of centroids.
- Do we still obtain good recall and QPS for these larger dataset sizes?
- Can we still afford to run k-means at index time, given a larger number of centroids? With 10,000 centroids for example, each time we index a document we’ll be computing the distance between the document and 10,000 other vectors. This is a big concern and I think we would need strategies to address it.
It’s great that coarse quantization is relatively simple and could be implemented with existing data structures. But would we expect a much bigger speed-up and better scaling with a graph approach like HNSW? I think this still requires more analysis.
More thinking is required as to how to handle deleted documents (as discussed in LUCENE-9004).

asfimport commented 4 years ago

Xin-Chun Zhang (migrated from JIRA)

Hi, @jtibshirani, thanks for you excellent work!

I was thinking we could actually reuse the existing PostingsFormat and DocValuesFormat implementations.

Yes, the codes could be simple by reusing these formats. But I agree with @mocobeta that ANN search is a pretty new feature to Lucene, it's better to use a dedicated format for maintaining reasons. Moreover, If we are going to use a dedicated vector format for HNSW, this format should also be applied to IVFFlat because IVFFlat and HNSW are used for the same purpose of ANN search. It may be strange to users if IVFFlat and HNSW perform completely different.

In particular, it doesn’t require random access for doc values, they are only accessed through forward iteration.

Actually, we need random access to the vector values! For a typical search engine, we are going to retrieving the best matched documents after obtaining the TopK docIDs. Retrieving vectors via these docIDs requires random access to the vector values.

asfimport commented 4 years ago

Xin-Chun Zhang (migrated from JIRA)

My personal git branch: https://github.com/irvingzhang/lucene-solr/tree/jira/lucene-9136-ann-ivfflat.
The vector format is as follows,

Structure of IVF index meta is as follows,

Structure of IVF data:

Ann-benchmark tool could be found in: https://github.com/irvingzhang/ann-benchmarks.

Benchmark results (Single Thread, 2.5GHz * 2CPU, 16GB RAM, nprobe=8,16,32,64,128,256, centroids=4*sqrt(N), where N the size of dataset):

1) Glove-1.2M-25D-Angular: index build + training cost 706s, qps: 18.8\~49.6, recall: 76.8%\~99.7%

2) Glove-1.2M-100D-Angular: index build + training cost 2487s, qps: 12.2\~38.3, recall 65.8%\~96.3%

3) Sift-1M-128D-Euclidean: index build + training cost 2397s, qps 14.8\~38.2, recall 71.1%\~99.2%

asfimport commented 4 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

> I was thinking we could actually reuse the existing PostingsFormat and DocValuesFormat implementations.

That's the one of the main reason why this approach is interesting for Lucene. The main operation at query time is a basic inverted lists search so it would be a shame to not reuse the existing formats that were designed for this purpose. In general I think that this approach (k-means clustering at index time) is very compelling since it's a light layer on top of existing functionalities. The computational cost is big (running k-means and assigning vectors to centroids) but we can ensure that it remains acceptable by capping the number of centroids or by using an hybrid approach with a small-world graph like Julie suggested.

Regarding the link with the graph-based approach, I wonder what the new ANN Codec should expose. If the goal is to provide approximate nearest neighbors capabilities to Lucene I don't think we want to leak any implementation details there.

It's difficult to tell now since both effort are in the design phase but I think we should aim at something very simple that only exposes an approximate nearest neighbor search. Something like:

interface VectorFormat {
  TopDocs ann(int topN, int maxDocsToVisit);
  float[] getVector(int docID);
}

should be enough. Most of the format we have in Lucene have sensible defaults or compute parameters based on the shape of the data so I don't think we should expose tons of options here. This is another advantage of this approach in my opinion since we can compute the number of centroids needed for each segment automatically. The research in this area are also moving fast so we need to remain open to new approaches without requiring to add a new format all the time.

> Actually, we need random access to the vector values! For a typical search engine, we are going to retrieving the best matched documents after obtaining the TopK docIDs. Retrieving vectors via these docIDs requires random access to the vector values.

You can sort the TopK (which should be small) by docIDs and then perform the lookup sequentially ? That's how we retrieve stored fields from top documents in the normal search. This is again an advantage against the graph based approach because it is compliant with the search model in Lucene that requires forward iteration.

To move forward on this issue I'd like to list the things that need clarifications in my opinion:

Do we need a new VectorFormat that can be shared with the graph-based approach ?
- This decision and the design of the VectorFormat is important to ensure that both efforts can move independently. Currently it it not clear if this approach can move forward if the graph-based approach is stalled or needs more work. I tend to think that having a simple format upfront can drive decisions we make on both approaches so we should tackle this first.
What is the acceptable state for this approach to be considered ready to merge ?

** Lots of optimizations have been mentioned in both issues but I think we should drive for simplicity first. That's the beauty of the k-means approach, it's simple to understand and reason about. We should have a first version that reuses the internal data formats since they fit perfectly. I think that's what Julie's pr brings here while leaving the room for further improvements like any Lucene features. ** We should decorrelate the progress here from the one in the other Lucene issue. This is linked to question 1 but I think it's key to move forward.

In general I feel like the branch proposed by Xin-Chun Zhang and the additional changes by @jtibshirani are moving toward the right direction. The qps improvement over a brute-force approach are already compelling as outlined in https://github.com/apache/lucene-solr/pull/1314 so I don't think it will be difficult to have a consensus whether this would be useful to add in Lucene.

asfimport commented 4 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

Do we need a new VectorFormat that can be shared with the graph-based approach ?

About this point, I think we don't need to consider both approaches at once. Please don't wait or take care the hnsw issue, and concentrate to get this in the master. I or someone with more knowledge/experience in this area will find the good way to integrate the graph-based approach later.

asfimport commented 4 years ago

Jim Ferenczi (@jimczi) (migrated from JIRA)

Thanks for chiming in @mocobeta. Although I still think it would be valuable to discuss the minimal signature needed to share a new codec in both approaches. I also think that there is a consensus around the fact that multiple strategies could be needed depending on the trade-offs that users are willing to take. If we start adding codecs and formats for every strategy that we think valuable I am afraid that this will block us sooner that we expect. If we agree that having a new codec for vectors and ann is valuable in Lucene, my proposal is to have a generic codec that can be used to test different strategies (k-means, hsnw, ...). IMO this could also changed the goal for these approaches since we don't want to require users to tune tons of internal options (numbers of neighbors, numbers of levels, ...) upfront.

asfimport commented 4 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

@jimczi Thank you for elaborating. I agree with you, it's great if we have some abstraction for vectors (interface or abstract base class with default implementation?) for experimenting different ann search algorithms.

asfimport commented 4 years ago

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

A note that I filed #10362 to start a discussion around a unified format API.

asfimport commented 3 years ago

Alessandro Benedetti (@alessandrobenedetti) (migrated from JIRA)

Now that #10362 has been resolved, what's remaining in this issue to be merged?

Mahdi-Seeker commented 2 years ago

Hi guys Thanks for your great job on Lucene and specially this ANN search!

Any progress on this issue? We're trying to use vector search, but HNSW seems to take too much memory...