elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.55k stars 24.61k forks source link

cross_fields multi match with dfs_query_then_fetch type uses field level idf #10346

Open micpalmia opened 9 years ago

micpalmia commented 9 years ago

I tested this on Elasticsearch 1.5.0 (and on ES 1.4.2 and on 1.3.0)

The documentation for the multi_match query of type cross_fields states that

The problem of differing term frequencies is solved by blending the term frequencies for all fields in order to even out the differences.

This holds true when executing a query of this type with search type query_then_fetch: in this case, the same (approximated, shard-level) idf is used for all fields. When using search type query_dfs_then_fetch, on the other hand, the specific field idf is used.

I would expect the scatter phase to provide to the multi match query the right merged idf, and not to completely overwrite the (approximated but still more correct) shard level idf with global field-specific idfs.

An obvious test is provided in the following gist https://gist.github.com/micpalmia/c812200617307d78d495

A series of documents are inserted in one shard only, and when a cross_field query is executed with query_dfs_then_fetch, unmerged idfs are used for the two fields.

clintongormley commented 9 years ago

Nice demonstration. I confirm that this is a bug.

vharitonsky commented 8 years ago

+

mkrakovian commented 8 years ago

@clintongormley Hi, couldn't U please update on the status of this issue, we've run into it as well, running the query on multiple shards, using dfs_query_then_fetch. Also, while trying to understand the cause I've encountered this post on elastic discuss: question

gjbh-idematica commented 7 years ago

I also ran into this bug. It would be great if it could be solved.

JosephTucci commented 7 years ago

I have also ran into this issue.

cbuescher commented 6 years ago

cc @elastic/es-search-aggs

piyush-scio commented 5 years ago

Facing the same issue -- any updates on this ?

ywelsch commented 2 years ago

I can confirm that the implementation of this feature (BlendedTermQuery) does indeed not take distributed stats into account.

Note that the cross_fields type blends field statistics in a way that does not always produce well-formed scores (for example scores can become negative). As an alternative, you can consider the combined_fields query, which is also term-centric but combines field statistics in a more robust way.

While combined_fields did not use to work with dfs_query_then_fetch either, this has been fixed (to be released in an upcoming ES version).

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)