elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.2k stars 24.84k forks source link

How to prune non-relevant top documents automatically #55603

Open jimczi opened 4 years ago

jimczi commented 4 years ago

In 7.0 we've added an optimization that allows to run pure disjunction queries (OR) without visiting all matches of the most frequent terms. Prior to this version, users have to ensure that they remove the most frequent terms (stop words removal) or switch to the common terms query to get acceptable performance. We've decided to deprecate the common terms query for this reason. Users shouldn't rely on a cutoff_frequency in order to ensure fast disjunctions. The fact that this cutoff_frequency should change when documents are added/deleted but also that the frequency of the same term can be different even on replicas (since deleted docs are part of the count) makes it slightly dangerous to use. A small change in your index can make some queries much slower because an high-frequency terms don't reach the current cutoff_frequency anymore.

However, the common terms query is also sometimes used to improve the precision of search results. For instance the query the OR beatles would return top documents containing only the if there are no document containing the term beatles. Using the common terms query can ensure (assuming that the cuttof_frequency considers the as a frequent term) that no results are returned in this case. This looks like a valid use case for this query so we're wondering if should un-deprecate it since we don't have a direct replacement for this feature. One thing that was raised during the initial discussion is that we should look at improving the detection of high frequent terms without the need for users to provide a precise cuttof_frequency. We also think that it's worth discussing all options which is why I am opening this issue and marking it as a blocker for 8.0.

I am curious to hear thoughts from users of the common terms query and particularly how do you deal with changing indices to update the cutoff_frequency ?

elasticmachine commented 4 years ago

Pinging @elastic/es-search (:Search/Search)

mayya-sharipova commented 4 years ago

we should look at improving the detection of high frequent terms without the need for users to provide a precise cuttof_frequency.

Another idea is to have a dynamic min_score that is supplied as percentage of the max_score. After we have already collected top hits, and we know max_score and min_score in them, we can filter out hits that have too little scores.

markharwood commented 4 years ago

However, the common terms query is also sometimes used to improve the precision of search results.

Dropping terms is one way to improve precision but there are others too:

What I was considering when looking at ways to trim long tails of garbage was ways to tell when switching from a stricter search strategy to a weaker one breaks the meaning of the query. I think this may be detect-able if you are lucky enough to have well-categorised data (many ecommerce vendors spend a lot of time on this). There can be a step-change in the diversity of categories as you switch from a strong strategy to a weaker one - the count of categories can act as a measure of the number of different meanings a query clause has. Consider this analysis of high-scoring versus low-scoring results in an ecommerce query:

If we can organise results into buckets based on the query clause strictness (I used large boosts to separate two clauses in the above example) then we can use a count of categories in each bucket as a measure of the focus in each clause. Poorly focused clauses might be ones with hundreds of categories and would be ones we might choose to drop.

mabdelhedi commented 3 years ago

Hello, I am also facing a case when removing the cutoff_frequency on my multi-match query, for example searching on a "street" field "jump street", I have some documents that only match the word "street" (frequent word) ...

How can I really "cut off" the frequent term to be ignored from query (so I could have only documents containing at least "jump") without having to look for all frequent words and define them as stopwords ?

jimczi commented 3 years ago

I am removing the blocker label for now. We are still not decided if we should restore the functionality or provide a replacement.

elasticsearchmachine commented 2 years ago

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine commented 4 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)