Open joshdevins opened 1 year ago
Pinging @elastic/es-search (Team:Search)
@liranabn This is a rather significant effort, propose removing this from the Search Relevance team project.
@kderusso @liranabn I'd propose a technical spike first to determine the viability of a couple of approaches. I think it's something we can do in relatively short order. That would at least give some better scoping to the work for whoever picks it up next.
Pinging @elastic/es-search-relevance (Team:Search Relevance)
@jimczi @tteofili comparing with https://github.com/elastic/elasticsearch/issues/55603
I wonder if there is a way to unify the experience of "cut of the tail of irrelevant things" for any distribution of scores?
Something like min_score: auto
, that will work for any query
(knn, sparse vector, BM25). It seems to me if the algorithm simply gathers the distributions of scores and sees when the "tail" begins, it can handle any of it just fine.
Also, min_score
could be configured to to allow capturing additional extreme values like min_score: auto^2
:).
This seems like a neat idea IMO. We should be able to support this in min_score
and then we get it "for free" everywhere (but nothing is really for free)
/cc @joshdevins ^
Using something like kneedle should allow this to work for all scores, independent of source/scoring function. I'm curious about other methods that we could evaluate as well though, such as nearest neighbour ratios. Best case scenario, we evaluate a few methods against MSMARCO, BEIR, etc. and understand the characteristics of each before we make decisions about how to proceed. Since the biggest need is for matchers that match all documents (i.e. kNN), I would propose to focus the priority there.
Description
Users new to vector search are sometimes surprised by one major difference relative to lexical/BM25 search — all vectors "match" the query vector. Users are frequently familiar only with lexical search and that documents will only be returned that contain keywords from the query (including synonyms, different surface forms from stemming, etc.). To partially address this, Elasticsearch implements a
similarity_threshold
argument in kNN search that allows a user to statically select a threshold where no results beyond the similarity threshold provided will be returned. This solves the problem when a "good" threshold value can be set, however in practice this threshold value can be difficult to find and can vary from query-to-query.Before kNN was widely adopted for information retrieval, it was used for applications such as audio fingerprinting. In literature, we find this same problem appearing of having difficulty finding a universal threshold value. To address the problem, automatic methods were proposed.
As an Elasticsearch user, for a variety of use-cases include information retrieval, I'd like the option for Elasticsearch to automatically select an appropriate kNN cut-off on a per-query basis. The automatic cut-off should aim to improve precision of kNN search while sacrificing recall as little as possible. Benchmarking to prove the method should be shown, including using common vector search benchmarks like MS MARCO and BEIR.
References