Open aslamy opened 6 years ago
cc @elastic/es-search-aggs
@romseygeek Could you take a look at this?
This is a known issue in Lucene and we're currently discussing different options for the fix:
https://issues.apache.org/jira/browse/LUCENE-8137
The only workaround for now is to not use the stop word filter when using the synonym_graph
or to remove the stop words manually from the synonyms defined for the filter.
I will be closing this issue, as the issue in on the Lucene level (it has been opened and currently in progress), and there is nothing we ca do on the Elastic level.
Hey @jimczi - just wanted to follow up on this. I'm getting a similar issue. The exact bug above (where only 2 out of 3 matches are found) no longer occurs (I'm using ES 7.6.0) - good news. And if you switch the order of the stopword and synonym_graph filters, you still get the illegal_argument_exception
as expected (the Lucene bug has not been fixed). HOWEVER, with the filters in the new order, the workaround described above does not work:
This is a known issue in Lucene and we're currently discussing different options for the fix: https://issues.apache.org/jira/browse/LUCENE-8137 The only workaround for now is to not use the stop word filter when using the
synonym_graph
or to remove the stop words manually from the synonyms defined for the filter.
If in the example above, you put synonym graph filter AFTER the stopwords filter AND manually remove stopwords from the synonyms (i.e. now synonyms=["world war, wow"]
), then a query with "world of war"
CANNOT match text with "world of war
. Did I misunderstand the workaround? (That's very likely because I imagine lots of people use synonym_graph with stopwords.)
Thanks in advance!
(PS: the reason I need to put synonym_graph AFTER stopwords is that the stopwords are case sensitive whereas the synonyms are not case sensitive)
If helpful, here are the requests I'm running:
PUT /test-xxx
{
"settings":{
"analysis":{
"analyzer":{
"english_analyzer":{
"type":"custom",
"filter":[
"lowercase",
"english_stopwords_tokenfilter"
],
"tokenizer":"standard"
},
"english_search_analyzer":{
"type":"custom",
"filter":[
"lowercase",
"english_stopwords_tokenfilter",
"synonym_graph_tokenfilter"
],
"tokenizer":"standard"
}
},
"filter":{
"english_stopwords_tokenfilter":{
"type":"stop",
"stopwords":"_english_"
},
"synonym_graph_tokenfilter":{
"type":"synonym_graph",
"synonyms":[
"world war, wow"
]
}
}
}
},
"mappings":{
"properties":{
"title":{
"type":"text",
"analyzer":"english_analyzer",
"search_analyzer":"english_search_analyzer"
}
}
}
}
POST _bulk
{ "index" : { "_index" : "test-xxx" } }
{ "title":"world of war" }
{ "index" : { "_index" : "test-xxx" } }
{ "title":"wow" }
{ "index" : { "_index" : "test-xxx" } }
{ "title":"world of war. wow" }
GET /test-xxx/_search
{
"query":{
"match":{
"title":"world of war"
}
},
"highlight":{
"fields":{
"title":{
"fragment_size":0,
"type":"unified"
}
}
}
}
DELETE /test-xxx
I am reopening this issue since it's a long standing bug and it's not resolved in Lucene.
The only workaround that work at the moment is to not use stop words, at index and query time.
You can define rules with and without stop words, for instance:
"world of war, world war, wow
should match all variations.
Removing terms in a filter before or after the synonym graph should be avoided until the bug is resolved.
We want to solve this situation but it is not likely to happen before a major release considering the changes that are required on the analysis chain.
Pinging @elastic/es-search-relevance (Team:Search Relevance)
Elasticsearch 6.2.0
Description: When using stop and graph synonym filters together, the document that should match doesn't match and highlight doesn't work as it should.
Step to reproduce:
Mapping
Indexing 3 documents
Search
Search Result:
Problems: Bug 1. Document { "title":"world of war"} does not match. But it should match. Bug 2. Highlighter does not highlight "world of war".
I have also tried to put synonym_graph_tokenfilter after english_stopwords_tokenfilter filter but I get: