Automatic cut-off of kNN results

joshdevins commented 1 year ago

Description

Users new to vector search are sometimes surprised by one major difference relative to lexical/BM25 search — all vectors "match" the query vector. Users are frequently familiar only with lexical search and that documents will only be returned that contain keywords from the query (including synonyms, different surface forms from stemming, etc.). To partially address this, Elasticsearch implements a similarity_threshold argument in kNN search that allows a user to statically select a threshold where no results beyond the similarity threshold provided will be returned. This solves the problem when a "good" threshold value can be set, however in practice this threshold value can be difficult to find and can vary from query-to-query.

Before kNN was widely adopted for information retrieval, it was used for applications such as audio fingerprinting. In literature, we find this same problem appearing of having difficulty finding a universal threshold value. To address the problem, automatic methods were proposed.

As an Elasticsearch user, for a variety of use-cases include information retrieval, I'd like the option for Elasticsearch to automatically select an appropriate kNN cut-off on a per-query basis. The automatic cut-off should aim to improve precision of kNN search while sacrificing recall as little as possible. Benchmarking to prove the method should be shown, including using common vector search benchmarks like MS MARCO and BEIR.

References

Using the kneedle algorithm to dynamically find a cut-off threshold: https://github.com/weaviate/weaviate/issues/2318
Using nearest neighbour ratios:
- SIFT audio fingerprinting use of ratio thresholds
- FLANN implementation in OpenCV
Related discussion in an open forum

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-search (Team:Search)

kderusso commented 11 months ago

@liranabn This is a rather significant effort, propose removing this from the Search Relevance team project.

joshdevins commented 11 months ago

@kderusso @liranabn I'd propose a technical spike first to determine the viability of a couple of approaches. I think it's something we can do in relatively short order. That would at least give some better scoping to the work for whoever picks it up next.

elasticsearchmachine commented 3 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)

benwtrent commented 1 month ago

@jimczi @tteofili comparing with https://github.com/elastic/elasticsearch/issues/55603

I wonder if there is a way to unify the experience of "cut of the tail of irrelevant things" for any distribution of scores?

Something like min_score: auto, that will work for any query (knn, sparse vector, BM25). It seems to me if the algorithm simply gathers the distributions of scores and sees when the "tail" begins, it can handle any of it just fine.

benwtrent commented 1 month ago

Also, min_score could be configured to to allow capturing additional extreme values like min_score: auto^2 :).

This seems like a neat idea IMO. We should be able to support this in min_score and then we get it "for free" everywhere (but nothing is really for free)

/cc @joshdevins ^

joshdevins commented 1 month ago

Using something like kneedle should allow this to work for all scores, independent of source/scoring function. I'm curious about other methods that we could evaluate as well though, such as nearest neighbour ratios. Best case scenario, we evaluate a few methods against MSMARCO, BEIR, etc. and understand the characteristics of each before we make decisions about how to proceed. Since the biggest need is for matchers that match all documents (i.e. kNN), I would propose to focus the priority there.

elastic / elasticsearch

Automatic cut-off of kNN results #99416

Description

References