elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.06k stars 24.83k forks source link

Maximum radius for kNN search API #84929

Closed rw-access closed 1 year ago

rw-access commented 2 years ago

Description

The API for approximate kNN search with dense_vector is 🔥. One of my current use cases involves a maximum distance check, as in the fixed-radius variant of nearest neighbors.

It would be a nice addition to the API in my opinion to have the ability to add a radius filter to results. It may even provide opportunities to bail early when performing the HNSW search, but that could be tricky since sometimes the first approximate neighbor is actually farther away from the second approximate neighbor.

In addition to that, there are some other advantages by adding the maximum distance to the API and implementing it server side: when results from shards are aggregated to find the combined ANN across the index.

(I apologize if this comes across as overly prescriptive or if I made incorrect assumptions. I'm fairly certain that this can decently achieved client-side with some simple filtering and by increasing num_candidates)

elasticmachine commented 2 years ago

Pinging @elastic/es-search (Team:Search)

jtibshirani commented 2 years ago

@rw-access thanks for the feedback! I have also wondered if it'd be helpful to expose a "radius limit" capability. I agree it's more natural to implement it within Elasticsearch rather than client side.

Could you give a little more context on your use case? This can help us as we think through the API (for example is it a radius limit on the kNN search, or should there be a separate "radius query" that matches every document within the radius)?

rw-access commented 2 years ago

Thanks for the reply @jtibshirani!

My use case is twofold: DBSCAN and near-duplicate detection.

For DBSCAN, I know my radius already (experimentally derived) and want to build clusters given that. This means that there are a lot of calls I make where I want to collect all the neighbors within a specific radius (with my current iterative approach I actually only need to collect the 1-2 closest within the radius).

For my usage it's perfectly okay for this to be approximate and that's an acceptable tradeoff. Also, when clustering with DBSCAN using approximate kNN, if point A doesn't find neighbor B within the approximate search + radius filter, there's a decent chance that B finds A.

For near-duplicate detection, I'm essentially doing a quick lookup with the parameters: k=1, num_candidates=10 and then do a client side check for the maximum radius.

Whether that's the same API or not is a good question. I'm not sure how HNSW handles exact kNN within a radius and if that as performant. For me at least, I don't need to be as exact. I would defer to your expertise.

jtibshirani commented 2 years ago

I pondered the API and played around with a prototype. @rw-access it'd be great to hear your thoughts on this.

My current thinking is that we could implement this as a new query type, available in the _search API. Here's an example of how it could be used:

POST index/_search
{
 "query": {
   "vector_radius": {
     "field": "image-vector",
     "query_vector": [0.1, ....],
     "radius": 0.24,
     "num_candidates": 100
   }
 }
}  

The query would be available in the _search endpoint because it can easily combine with other query types through a bool, be used with aggregations, etc.

How the implementation would work: we'd use Lucene’s ANN search methods to retrieve num_candidates nearest neighbors per segment. Then we'd would filter out those candidates that are not within the search radius. This means that the user needs to pick a sufficiently large num_candidates to make sure the results are not too incomplete. The implementation is basically the same as kNN search, except we apply a radius cutoff instead of taking the top k.

It may be possible to design a more sophisticated algorithm, where the user doesn’t need to specify num_candidates, and we just continue to the graph search until we stop finding vectors within the radius. Also in theory it seems possible to use the radius to speed up the initial ANN search. I played around with a bunch of these potential improvements, but didn't find any promising directions! I also didn't like we would use a totally new algorithm without support of an academic paper :)

nariman9119 commented 1 year ago

Are there any updates on this issue? We would find a lot of use in this feature

anordertoreclaim commented 1 year ago

@jtibshirani hi!

We want to combine symbolic and semantic searches together and we would greatly benefit from this feature, as it would allow us to filter out irrelevant documents found through semantics (ANN). I understand that you're not working for ElasticSearch anymore, but perhaps you could forward it to right people.

benwtrent commented 1 year ago

@anordertoreclaim @rw-access I am iterating on this now.

Are we cool with doing an actual radius filter (meaning L2 distance post-filtering), even if the vectors are stored with cosine similarity (thinking future binary hamming distance in the future.... or dot-product with non-unit vectors...)? Some of our similarity metrics can easily map to a radius, but others, maybe not.

Or are your scenarios @anordertoreclaim @rw-access only concerned with vectors stored with l2_norm similarity already?

benwtrent commented 1 year ago

And ping @nariman9119 ^^ as well :)

anordertoreclaim commented 1 year ago

In our case, our vectors are L2-normalized and we would rather use the dot product (i.e. cosine similarity) threshold, but we'll be fine with L2 distance, too.

benwtrent commented 1 year ago

In our case, our vectors are L2-normalized and we would rather use the dot product (i.e. cosine similarity) threshold, but we'll be fine with L2 distance, too.

thanks @anordertoreclaim , I have been thinking more on this, and I think it will be possible to simply utilize whatever distance metric was used to build the index (e.g. configured in the vector field).

I will get working on this this week.

benwtrent commented 1 year ago

I called it vector_similarity as it seemed to me the only "radius" would be if we did l2_norm.

But, effectively, it takes a similarity threshold that vectors must be "this similar" to be considered.

In the l2_norm case, similarity can be considered the radius of the sphere of dimension dims.

At each shard, it will gather num_candidates, only keeping docs where the vectors are within the similarity threshold. The document _score is the same as KNN and it supports boost.

https://github.com/elastic/elasticsearch/pull/93574

@anordertoreclaim @rw-access @nariman9119

javanna commented 1 year ago

Is this addressed by #94828 ?

benwtrent commented 1 year ago

@anordertoreclaim @rw-access @nariman9119 ^ Have you tried the new similarity field in knn?

benwtrent commented 1 year ago

Closing this issue. If the current similarity_threshold logic is inadequate, we can reopen or add new functionality.