elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
865 stars 24.8k forks source link

using synonym_graph force elastic to double score the document #28982

Open ahmadazimi opened 6 years ago

ahmadazimi commented 6 years ago

Elasticsearch version: 6.2.2, Build: 10b1edd/2018-02-16T19:01:30.685723Z

Plugins installed: []

JVM version: 1.8.0_144

OS version: Ubuntu, Linux 4.4.0-104-generic

When I use synonym_graph in search time analyzer, some words which has more than one segments for example coffee shop treated as two words and make score double!

I defined coffee shop as a synonym of cafe, then when I search for cafe all documents which has coffee shop in their titles have greater scores than same documents which have cafe in their titles (about 2 times greater).

I've used Explain Api and found these scores returned by elastic:

For a document with coffee shop in its title, sum of: 59.249336 weight(search:coffee in 9429) [PerFieldSimilarity] 63.80951 weight(search:shop in 9429) [PerFieldSimilarity]

And for another document with cafe in its title: 34.8931 weight(search:cafe in 4409) [PerFieldSimilarity]

Is this a bug in synonym_graph or I had a mistake?

PS: all other keywords for these two documents are same.

elasticmachine commented 6 years ago

Pinging @elastic/es-search-aggs

ahmadazimi commented 6 years ago

Aby update around this issue?

ahmadazimi commented 6 years ago

Yesterday I've found another bad unacceptable issue which it seems like above issue (synonym_graph). Imagine you have two documents: Emmy Cafe and Emmy Coffee Wholesale Shop, so you define coffee shop as a synonym of cafe via synonym_graph. Now when you search for cafe, second document which has coffee and shop in its title get an score about two times greater than the first document and always be the first result in the result set. PS. norms set to false in mapping for search field.

colings86 commented 6 years ago

@romseygeek could you take a look at this?

jimczi commented 6 years ago

I think it's reasonable to use a max disjunction for multi-terms synonyms, currently the scores of the matching synonyms are simply added but we should select the max score. As @ahmadazimi reported in his last comment this is not enough since the scoring also depends on the number of terms in each variant. We have something in place for single term synonyms with the SynonymQuery but it would be difficult to generalize the idea with multi-terms. Changing the query to use a max disjunction is trivial so we should start with that, this will already improves things. In the mean time we can think of a more general solution that would allow to produce a single score per synonym rule but that's not a low hanging fruit.

ahmadazimi commented 6 years ago

So is there any easy way to handle it in the current version (6.2.2)?

jimczi commented 6 years ago

No there is no workaround in the current version, we'll need a patch, first to select the best synonym score per document which as I said should be trivial to do and then work on a solution to produce similar scores for documents that match caffe and documents that match coffee shop. Though the latter is not something that we can do easily so I wouldn't expect a solution anytime soon.

pierremalletneo9 commented 5 years ago

Hello, is the first step for the solution explained by jimczi will be implemented in elasticsearch in the current 7.X version? Currently, the scoring with multi-words synonyms it a bit hard to work with. Thanks!

BenjD90 commented 2 years ago

Hello @jimczi,

Is there any space in your roadmap for this improvement ?

Thanks

elasticsearchmachine commented 3 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)