Closed rw-access closed 1 year ago
Pinging @elastic/es-search (Team:Search)
@rw-access thanks for the feedback! I have also wondered if it'd be helpful to expose a "radius limit" capability. I agree it's more natural to implement it within Elasticsearch rather than client side.
Could you give a little more context on your use case? This can help us as we think through the API (for example is it a radius limit on the kNN search, or should there be a separate "radius query" that matches every document within the radius)?
Thanks for the reply @jtibshirani!
My use case is twofold: DBSCAN and near-duplicate detection.
For DBSCAN, I know my radius already (experimentally derived) and want to build clusters given that. This means that there are a lot of calls I make where I want to collect all the neighbors within a specific radius (with my current iterative approach I actually only need to collect the 1-2 closest within the radius).
For my usage it's perfectly okay for this to be approximate and that's an acceptable tradeoff. Also, when clustering with DBSCAN using approximate kNN, if point A doesn't find neighbor B within the approximate search + radius filter, there's a decent chance that B finds A.
For near-duplicate detection, I'm essentially doing a quick lookup with the parameters: k=1, num_candidates=10 and then do a client side check for the maximum radius.
Whether that's the same API or not is a good question. I'm not sure how HNSW handles exact kNN within a radius and if that as performant. For me at least, I don't need to be as exact. I would defer to your expertise.
I pondered the API and played around with a prototype. @rw-access it'd be great to hear your thoughts on this.
My current thinking is that we could implement this as a new query type, available in the _search
API. Here's an example of how it could be used:
POST index/_search
{
"query": {
"vector_radius": {
"field": "image-vector",
"query_vector": [0.1, ....],
"radius": 0.24,
"num_candidates": 100
}
}
}
The query would be available in the _search
endpoint because it can easily combine with other query types through a bool
, be used with aggregations, etc.
How the implementation would work: we'd use Lucene’s ANN search methods to retrieve num_candidates
nearest neighbors per segment. Then we'd would filter out those candidates that are not within the search radius
. This means that the user needs to pick a sufficiently large num_candidates
to make sure the results are not too incomplete. The implementation is basically the same as kNN search, except we apply a radius cutoff instead of taking the top k.
It may be possible to design a more sophisticated algorithm, where the user doesn’t need to specify num_candidates
, and we just continue to the graph search until we stop finding vectors within the radius. Also in theory it seems possible to use the radius to speed up the initial ANN search. I played around with a bunch of these potential improvements, but didn't find any promising directions! I also didn't like we would use a totally new algorithm without support of an academic paper :)
Are there any updates on this issue? We would find a lot of use in this feature
@jtibshirani hi!
We want to combine symbolic and semantic searches together and we would greatly benefit from this feature, as it would allow us to filter out irrelevant documents found through semantics (ANN). I understand that you're not working for ElasticSearch anymore, but perhaps you could forward it to right people.
@anordertoreclaim @rw-access I am iterating on this now.
Are we cool with doing an actual radius
filter (meaning L2 distance post-filtering), even if the vectors are stored with cosine
similarity (thinking future binary hamming distance in the future.... or dot-product with non-unit vectors...)? Some of our similarity metrics can easily map to a radius
, but others, maybe not.
Or are your scenarios @anordertoreclaim @rw-access only concerned with vectors stored with l2_norm
similarity already?
And ping @nariman9119 ^^ as well :)
In our case, our vectors are L2-normalized and we would rather use the dot product (i.e. cosine similarity) threshold, but we'll be fine with L2 distance, too.
In our case, our vectors are L2-normalized and we would rather use the dot product (i.e. cosine similarity) threshold, but we'll be fine with L2 distance, too.
thanks @anordertoreclaim , I have been thinking more on this, and I think it will be possible to simply utilize whatever distance metric was used to build the index (e.g. configured in the vector field).
I will get working on this this week.
I called it vector_similarity
as it seemed to me the only "radius" would be if we did l2_norm
.
But, effectively, it takes a similarity threshold that vectors must be "this similar" to be considered.
In the l2_norm
case, similarity
can be considered the radius of the sphere of dimension dims
.
At each shard, it will gather num_candidates
, only keeping docs where the vectors are within the similarity threshold. The document _score
is the same as KNN and it supports boost
.
https://github.com/elastic/elasticsearch/pull/93574
@anordertoreclaim @rw-access @nariman9119
Is this addressed by #94828 ?
@anordertoreclaim @rw-access @nariman9119 ^ Have you tried the new similarity
field in knn
?
Closing this issue. If the current similarity_threshold logic is inadequate, we can reopen or add new functionality.
Description
The API for approximate kNN search with
dense_vector
is 🔥. One of my current use cases involves a maximum distance check, as in the fixed-radius variant of nearest neighbors.It would be a nice addition to the API in my opinion to have the ability to add a radius filter to results. It may even provide opportunities to bail early when performing the HNSW search, but that could be tricky since sometimes the first approximate neighbor is actually farther away from the second approximate neighbor.
In addition to that, there are some other advantages by adding the maximum distance to the API and implementing it server side: when results from shards are aggregated to find the combined ANN across the index.
(I apologize if this comes across as overly prescriptive or if I made incorrect assumptions. I'm fairly certain that this can decently achieved client-side with some simple filtering and by increasing
num_candidates
)