elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.41k stars 24.56k forks source link

Bug: When using graph synonym and stop token filter together #28838

Open aslamy opened 6 years ago

aslamy commented 6 years ago

Elasticsearch 6.2.0

Description: When using stop and graph synonym filters together, the document that should match doesn't match and highlight doesn't work as it should.

Step to reproduce:

Mapping

{  
   "settings":{  
      "analysis":{  
         "analyzer":{  
            "english_analyzer":{  
               "type":"custom",
               "filter":[  
                  "lowercase",
                  "english_stopwords_tokenfilter"
               ],
               "tokenizer":"standard"
            },
            "english_search_analyzer":{  
               "type":"custom",
               "filter":[  
                  "lowercase",
                  "synonym_graph_tokenfilter",
                  "english_stopwords_tokenfilter"
               ],
               "tokenizer":"standard"
            }
         },
         "filter":{  
            "english_stopwords_tokenfilter":{  
               "type":"stop",
               "stopwords":"_english_"
            },
            "synonym_graph_tokenfilter":{  
               "type":"synonym_graph",
               "synonyms":[  
                  "world of war, wow"
               ]
            }
         }
      }
   },
   "mappings":{  
      "doc":{  
         "properties":{  
            "title":{  
               "type":"text",
               "analyzer":"english_analyzer",
               "search_analyzer":"english_search_analyzer"
            }
         }
      }
   }
}

Indexing 3 documents

{  "title":"world of war"}
{  "title":"wow"}
{  "title":"world of war. wow"}

Search

{  
   "query":{  
      "match":{  
         "title":"world of war"
      }
   },
   "highlight":{  
      "fields":{  
         "title":{  
            "fragment_size":0,
            "type":"unified"
         }
      }
   }
}

Search Result:

{  
   "took":1,
   "timed_out":false,
   "_shards":{  
      "total":5,
      "successful":5,
      "skipped":0,
      "failed":0
   },
   "hits":{  
      "total":2,
      "max_score":0.2876821,
      "hits":[  
         {  
            "_index":"test",
            "_type":"doc",
            "_id":"2",
            "_score":0.2876821,
            "_source":{  
               "title":"world of war. wow"
            },
            "highlight":{  
               "title":[  
                  "world of war. <em>wow</em>"
               ]
            }
         },
         {  
            "_index":"test",
            "_type":"doc",
            "_id":"1",
            "_score":0.2876821,
            "_source":{  
               "title":"wow"
            },
            "highlight":{  
               "title":[  
                  "<em>wow</em>"
               ]
            }
         }
      ]
   }
}

Problems: Bug 1. Document { "title":"world of war"} does not match. But it should match. Bug 2. Highlighter does not highlight "world of war".

I have also tried to put synonym_graph_tokenfilter after english_stopwords_tokenfilter filter but I get:

{  
   "error":{  
      "root_cause":[  
         {  
            "type":"illegal_argument_exception",
            "reason":"failed to build synonyms"
         }
      ],
      "type":"illegal_argument_exception",
      "reason":"failed to build synonyms",
      "caused_by":{  
         "type":"parse_exception",
         "reason":"Invalid synonym rule at line 1",
         "caused_by":{  
            "type":"illegal_argument_exception",
            "reason":"term: world of war analyzed to a token (war) with position increment != 1 (got: 2)"
         }
      }
   },
   "status":400
}
javanna commented 6 years ago

cc @elastic/es-search-aggs

colings86 commented 6 years ago

@romseygeek Could you take a look at this?

jimczi commented 6 years ago

This is a known issue in Lucene and we're currently discussing different options for the fix: https://issues.apache.org/jira/browse/LUCENE-8137 The only workaround for now is to not use the stop word filter when using the synonym_graph or to remove the stop words manually from the synonyms defined for the filter.

mayya-sharipova commented 6 years ago

I will be closing this issue, as the issue in on the Lucene level (it has been opened and currently in progress), and there is nothing we ca do on the Elastic level.

kut commented 4 years ago

Hey @jimczi - just wanted to follow up on this. I'm getting a similar issue. The exact bug above (where only 2 out of 3 matches are found) no longer occurs (I'm using ES 7.6.0) - good news. And if you switch the order of the stopword and synonym_graph filters, you still get the illegal_argument_exception as expected (the Lucene bug has not been fixed). HOWEVER, with the filters in the new order, the workaround described above does not work:

This is a known issue in Lucene and we're currently discussing different options for the fix: https://issues.apache.org/jira/browse/LUCENE-8137 The only workaround for now is to not use the stop word filter when using the synonym_graph or to remove the stop words manually from the synonyms defined for the filter.

If in the example above, you put synonym graph filter AFTER the stopwords filter AND manually remove stopwords from the synonyms (i.e. now synonyms=["world war, wow"]), then a query with "world of war" CANNOT match text with "world of war. Did I misunderstand the workaround? (That's very likely because I imagine lots of people use synonym_graph with stopwords.)

Thanks in advance!

(PS: the reason I need to put synonym_graph AFTER stopwords is that the stopwords are case sensitive whereas the synonyms are not case sensitive)

If helpful, here are the requests I'm running:

PUT /test-xxx
{  
   "settings":{  
      "analysis":{  
         "analyzer":{  
            "english_analyzer":{  
               "type":"custom",
               "filter":[  
                  "lowercase",
                  "english_stopwords_tokenfilter"
               ],
               "tokenizer":"standard"
            },
            "english_search_analyzer":{  
               "type":"custom",
               "filter":[  
                  "lowercase",
                  "english_stopwords_tokenfilter",
                  "synonym_graph_tokenfilter"
               ],
               "tokenizer":"standard"
            }
         },
         "filter":{  
            "english_stopwords_tokenfilter":{  
               "type":"stop",
               "stopwords":"_english_"
            },
            "synonym_graph_tokenfilter":{  
               "type":"synonym_graph",
               "synonyms":[  
                  "world war, wow"
               ]
            }
         }
      }
   },
   "mappings":{  
     "properties":{  
        "title":{  
           "type":"text",
           "analyzer":"english_analyzer",
           "search_analyzer":"english_search_analyzer"
        }
     }
   }
}

POST _bulk
{ "index" : { "_index" : "test-xxx" } }
{ "title":"world of war" }
{ "index" : { "_index" : "test-xxx" } }
{ "title":"wow" }
{ "index" : { "_index" : "test-xxx" } }
{ "title":"world of war. wow" }

GET /test-xxx/_search
{  
   "query":{  
      "match":{  
         "title":"world of war"
      }
   },
   "highlight":{  
      "fields":{  
         "title":{  
            "fragment_size":0,
            "type":"unified"
         }
      }
   }
}

DELETE /test-xxx
jimczi commented 4 years ago

I am reopening this issue since it's a long standing bug and it's not resolved in Lucene. The only workaround that work at the moment is to not use stop words, at index and query time. You can define rules with and without stop words, for instance: "world of war, world war, wow should match all variations. Removing terms in a filter before or after the synonym graph should be avoided until the bug is resolved. We want to solve this situation but it is not likely to happen before a major release considering the changes that are required on the analysis chain.

elasticsearchmachine commented 1 month ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)