Integrate ANN search - Githubissues

jtibshirani commented 3 years ago

Background

Currently Elasticsearch supports storing vectors through the dense_vector field type and using them when scoring documents. This allows users to perform an exact k-nearest neighbors (kNN) search by scanning all documents. This work builds on that functionality to support fast, approximate nearest neighbor search (ANN). The implementation will use Lucene's new ANN support, which is based on the HNSW algorithm. Since Lucene will ship ANN in its upcoming 9.0 release, this feature will only target Elasticsearch 8.x.

Our plan is to extend the dense_vector field type to support adding vectors to an ANN index. We'll then add a new REST endpoint focused on kNN search. This new endpoint will be marked 'experimental' in the first release, as we expect to make API improvements in response to feedback. At first the endpoint will only perform kNN, but we'll follow-up with support for filtering, hybrid retrieval, aggregations, and more. We are really looking forward to everyone's feedback, which will help define the feature and set its direction.

Implementation Plan

Phase 0: Help prepare Lucene's HNSW implementation

[x] Run benchmarks to more deeply understand performance (https://issues.apache.org/jira/browse/LUCENE-9937)
[x] Ensure Lucene API has required features and plugin points
[x] Help resolve vector-related blockers to releasing Lucene 9.0

Phase 1: Basic ANN support

Update dense_vector field type to support ANN indexing
Add new API that supports ANN
Fix issues that pop up in Lucene
- [x] https://issues.apache.org/jira/browse/LUCENE-10147
- [x] https://issues.apache.org/jira/browse/LUCENE-10228
Performance testing and improvements
- [x] https://github.com/elastic/elasticsearch/pull/78724
- [x] https://github.com/elastic/rally-tracks/pull/217
Update documentation

Future Plans: Improvements to functionality and performance

81788
Support search timeouts and cancellation
Support "hybrid retrieval", where kNN results are combined with matches from another query
Narrow performance gap between Lucene HNSW and nmslib
- https://issues.apache.org/jira/browse/LUCENE-10054
Support other vector element types (bfloat16, integers, etc.)
- https://github.com/elastic/elasticsearch/issues/48322
- https://github.com/elastic/elasticsearch/issues/72067
Figure out best way to support "maximum inner product search" (dot product similarity with unnormalized vectors)
72068

elasticmachine commented 3 years ago

Pinging @elastic/es-search (Team:Search)

MLnick commented 3 years ago

Very excited to see this!

Support cosine similarity instead of dot product (?)

IMO both should be supported. Similar items via cosine sim is a very common use case, as is dot product (for e.g. recommendations).

jtibshirani commented 3 years ago

@MLnick thanks for the feedback, I updated the plan to make sure we cover both. Dot product may need special handling as it's not a true metric (for example doesn't satisfy the triangle inequality). I've also seen dot product used as an optimized cosine similarity, by normalizing all vectors to unit length beforehand -- this is more straightforward to support.

coreation commented 2 years ago

@mayya-sharipova great to see this being put to work, I was going through the documentation WIP (https://elasticsearch_80857.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/knn-search.html#exact-knn) and was a bit confused. There's a lot of work being done around ANN's, which is experimental, and there's "exact kNN", but correct me if I'm wrong, there's nothing new being done with regards to exact kNN's right? The function score there is already possible in 7.11 for example.

I was wondering if the effort being done will help speed up exact kNN searches, is that something that will be improved by this issue? If read the issue it doesn't sound like it, but I wanted to make sure I wasn't mistaken.

mayya-sharipova commented 2 years ago

@coreation

there's nothing new being done with regards to exact kNN's right?

You are right, this issue and the work done is concerned only approximate NN, and doesn't bring improvement to the exact KNN search.

I wondering what is your use case of the exact KNN, would it be possible for your use case to use ANN tuned for high accuracy/recall (Using a big number for num_candidates)?

Also what kind of speed ups are you thinking for the exact kNN search?

coreation commented 2 years ago

@mayya-sharipova Our use is that given a set of vectors, find the best fitting (N) other documents that are also complying with a set of filters. Currently we use a query in combination with a script score function for this, where the script score function can have 1-30 cosine similarity calculations, since we don't have 1 vector to match against, but a set of vectors.

This takes quite a bit of time, which is understandable given the sometimes 30 cosine similarity computations per score. I think aNN with filters will help speed up this process dramatically, if I understand it correctly because we don't need an exact score per se, just an idea of how well they match compared to the given vectors.

So your proposal of using ANN tuned for high accuracy/recall will suffice - and likely return results in a much faster manner.

tholor commented 2 years ago

+1 for the combination of ANN and filters!

From what I understand from this current draft, the combination of ANN and filtering won't be supported yet and will only be explored in a distinct future? Our use case is also heavily relying on running KNN on a filtered subset of documents. As these subsets are growing into the millions, we've reached the limits of KNN and hoped for switching to ANN with the 8.0 release. However, if ANN doesn't support filtering, we will run into accuracy problems when running this on the whole index.

jtibshirani commented 2 years ago

Thanks for the feedback! I was too ambitious in listing all these extensions (like filtering) under "Phase 2". I changed the heading name to "Future Plans". We'll tackle them in their own dedicated GitHub issues.

coreation commented 2 years ago

Thanks for the update @jtibshirani , will that issue (filtering) be linked as well in the main post when available?

msahamed commented 2 years ago

Thank you for the great work with ANN support. I could not agree more regarding @tholor's view on the ANN and filter. Filtering with ANN is among the powerful options that other databases lack. In my role as a data scientist, I feel this is a necessity every day. Hence, it would be more beneficial if the ANN +filter were a higher priority.

jtibshirani commented 2 years ago

I opened https://github.com/elastic/elasticsearch/issues/81788 to track work on supporting ANN with filtering (also linked under "Future Plans" in the description). From your comments, it sounds like filtering would be really useful and a high priority for you.

I'm going to close out this issue, since we've merged the work required for basic ANN support. This is just a beginning -- we expect to iterate on and improve the feature through other GitHub issues.

jtibshirani commented 2 years ago

I opened a new meta issue to track our follow-up work: https://github.com/elastic/elasticsearch/issues/84324.

elastic / elasticsearch

Integrate ANN search #78473

Background

Implementation Plan

Phase 0: Help prepare Lucene's HNSW implementation

Phase 1: Basic ANN support

Future Plans: Improvements to functionality and performance

81788

72068