Open ahmadazimi opened 6 years ago
Pinging @elastic/es-search-aggs
Aby update around this issue?
Yesterday I've found another bad unacceptable issue which it seems like above issue (synonym_graph).
Imagine you have two documents: Emmy Cafe and Emmy Coffee Wholesale Shop, so you define coffee shop
as a synonym of cafe
via synonym_graph
.
Now when you search for cafe, second document which has coffee
and shop
in its title get an score about two times greater than the first document and always be the first result in the result set.
PS. norms
set to false in mapping for search field.
@romseygeek could you take a look at this?
I think it's reasonable to use a max disjunction for multi-terms synonyms, currently the scores of the matching synonyms are simply added but we should select the max score. As @ahmadazimi reported in his last comment this is not enough since the scoring also depends on the number of terms in each variant. We have something in place for single term synonyms with the SynonymQuery but it would be difficult to generalize the idea with multi-terms. Changing the query to use a max disjunction is trivial so we should start with that, this will already improves things. In the mean time we can think of a more general solution that would allow to produce a single score per synonym rule but that's not a low hanging fruit.
So is there any easy way to handle it in the current version (6.2.2)?
No there is no workaround in the current version, we'll need a patch, first to select the best synonym score per document which as I said should be trivial to do and then work on a solution to produce similar scores for documents that match caffe
and documents that match coffee shop
. Though the latter is not something that we can do easily so I wouldn't expect a solution anytime soon.
Hello, is the first step for the solution explained by jimczi will be implemented in elasticsearch in the current 7.X version? Currently, the scoring with multi-words synonyms it a bit hard to work with. Thanks!
Hello @jimczi,
Is there any space in your roadmap for this improvement ?
Thanks
Pinging @elastic/es-search-relevance (Team:Search Relevance)
Elasticsearch version: 6.2.2, Build: 10b1edd/2018-02-16T19:01:30.685723Z
Plugins installed: []
JVM version: 1.8.0_144
OS version: Ubuntu, Linux 4.4.0-104-generic
When I use
synonym_graph
in search time analyzer, some words which has more than one segments for example coffee shop treated as two words and make score double!I defined
coffee shop
as a synonym ofcafe
, then when I search for cafe all documents which has coffee shop in their titles have greater scores than same documents which have cafe in their titles (about 2 times greater).I've used Explain Api and found these scores returned by elastic:
For a document with coffee shop in its title, sum of:
59.249336 weight(search:coffee in 9429) [PerFieldSimilarity]
63.80951 weight(search:shop in 9429) [PerFieldSimilarity]
And for another document with cafe in its title:
34.8931 weight(search:cafe in 4409) [PerFieldSimilarity]
Is this a bug in synonym_graph or I had a mistake?
PS: all other keywords for these two documents are same.