Closed jtibshirani closed 2 years ago
Pinging @elastic/es-search (Team:Search)
Very excited to see this!
Support cosine similarity instead of dot product (?)
IMO both should be supported. Similar items via cosine sim is a very common use case, as is dot product (for e.g. recommendations).
@MLnick thanks for the feedback, I updated the plan to make sure we cover both. Dot product may need special handling as it's not a true metric (for example doesn't satisfy the triangle inequality). I've also seen dot product used as an optimized cosine similarity, by normalizing all vectors to unit length beforehand -- this is more straightforward to support.
@mayya-sharipova great to see this being put to work, I was going through the documentation WIP (https://elasticsearch_80857.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/knn-search.html#exact-knn) and was a bit confused. There's a lot of work being done around ANN's, which is experimental, and there's "exact kNN", but correct me if I'm wrong, there's nothing new being done with regards to exact kNN's right? The function score there is already possible in 7.11 for example.
I was wondering if the effort being done will help speed up exact kNN searches, is that something that will be improved by this issue? If read the issue it doesn't sound like it, but I wanted to make sure I wasn't mistaken.
@coreation
there's nothing new being done with regards to exact kNN's right?
You are right, this issue and the work done is concerned only approximate NN, and doesn't bring improvement to the exact KNN search.
I wondering what is your use case of the exact KNN, would it be possible for your use case to use ANN tuned for high accuracy/recall (Using a big number for num_candidates)?
Also what kind of speed ups are you thinking for the exact kNN search?
@mayya-sharipova Our use is that given a set of vectors, find the best fitting (N) other documents that are also complying with a set of filters. Currently we use a query in combination with a script score function for this, where the script score function can have 1-30 cosine similarity calculations, since we don't have 1 vector to match against, but a set of vectors.
This takes quite a bit of time, which is understandable given the sometimes 30 cosine similarity computations per score. I think aNN with filters will help speed up this process dramatically, if I understand it correctly because we don't need an exact score per se, just an idea of how well they match compared to the given vectors.
So your proposal of using ANN tuned for high accuracy/recall will suffice - and likely return results in a much faster manner.
+1 for the combination of ANN and filters!
From what I understand from this current draft, the combination of ANN and filtering won't be supported yet and will only be explored in a distinct future? Our use case is also heavily relying on running KNN on a filtered subset of documents. As these subsets are growing into the millions, we've reached the limits of KNN and hoped for switching to ANN with the 8.0 release. However, if ANN doesn't support filtering, we will run into accuracy problems when running this on the whole index.
Thanks for the feedback! I was too ambitious in listing all these extensions (like filtering) under "Phase 2". I changed the heading name to "Future Plans". We'll tackle them in their own dedicated GitHub issues.
Thanks for the update @jtibshirani , will that issue (filtering) be linked as well in the main post when available?
Thank you for the great work with ANN support. I could not agree more regarding @tholor's view on the ANN and filter. Filtering with ANN is among the powerful options that other databases lack. In my role as a data scientist, I feel this is a necessity every day. Hence, it would be more beneficial if the ANN +filter were a higher priority.
I opened https://github.com/elastic/elasticsearch/issues/81788 to track work on supporting ANN with filtering (also linked under "Future Plans" in the description). From your comments, it sounds like filtering would be really useful and a high priority for you.
I'm going to close out this issue, since we've merged the work required for basic ANN support. This is just a beginning -- we expect to iterate on and improve the feature through other GitHub issues.
I opened a new meta issue to track our follow-up work: https://github.com/elastic/elasticsearch/issues/84324.
Background
Currently Elasticsearch supports storing vectors through the
dense_vector
field type and using them when scoring documents. This allows users to perform an exact k-nearest neighbors (kNN) search by scanning all documents. This work builds on that functionality to support fast, approximate nearest neighbor search (ANN). The implementation will use Lucene's new ANN support, which is based on the HNSW algorithm. Since Lucene will ship ANN in its upcoming 9.0 release, this feature will only target Elasticsearch 8.x.Our plan is to extend the
dense_vector
field type to support adding vectors to an ANN index. We'll then add a new REST endpoint focused on kNN search. This new endpoint will be marked 'experimental' in the first release, as we expect to make API improvements in response to feedback. At first the endpoint will only perform kNN, but we'll follow-up with support for filtering, hybrid retrieval, aggregations, and more. We are really looking forward to everyone's feedback, which will help define the feature and set its direction.Implementation Plan
Phase 0: Help prepare Lucene's HNSW implementation
Phase 1: Basic ANN support
dense_vector
field type to support ANN indexingFuture Plans: Improvements to functionality and performance
81788
72068