elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.52k stars 24.9k forks source link

'Failed to build synonyms' when using delimiter_graph + synonym_graph - 6.2.3 #29426

Open byronvoorbach opened 6 years ago

byronvoorbach commented 6 years ago

Hi there,

In the process of upgrading one of my clients from ES5.5.1 -> ES6.2.3, I got into an issue when trying to create an index in ES6. I worked out a small snippet to highlight my issue:

PUT test
{
  "settings": {
    "analysis": {
      "filter": {
        "delimiter_search": {
          "type": "word_delimiter_graph",
          "catenate_all": "true"
        },
        "synonyms": {
          "type": "synonym_graph",
          "synonyms": [
            "test1=>test"
          ]
        }
      },
      "analyzer": {
        "match_analyzer_search": {
          "tokenizer": "whitespace",
          "filter": [
            "trim",
            "asciifolding",
            "delimiter_search",
            "lowercase",
            "synonyms"
          ]
        }
      }
    }
  }
}

Generates the following error:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "failed to build synonyms"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "failed to build synonyms",
    "caused_by": {
      "type": "parse_exception",
      "reason": "Invalid synonym rule at line 1",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "term: test1 analyzed to a token (test) with position increment != 1 (got: 0)"
      }
    }
  },
  "status": 400
}

The actual error is in my custom synonym file, but I managed to reproduce with a single term. If I remove delimiter_search from the analyzer, there are no problems creating the index. The above works in ES5.5.1

danielmitterdorfer commented 6 years ago

Thanks for your report! The root cause seems to be that since 6.0, the synonym_graph filter, tokenizes synonyms with tokenizers / token filters that appear before it in the chain (see docs). As the word_delimiter_graph is set to catenate_all = true, the above error happens.

Honestly, I am not sure whether the behaviour here is intended or not, hence deferring to the search / aggs team for a definitive answer.

elasticmachine commented 6 years ago

Pinging @elastic/es-search-aggs

colings86 commented 6 years ago

@romseygeek could you take a look at this one?

romseygeek commented 6 years ago

@danielmitterdorfer is correct. We now use the preceding tokenizer chain to analyze terms in the synonym map, and word_delimiter_graph is producing multiple tokens at the same position, which the map builder doesn't know how to handle.

In the case above, removing the term1=>term mapping should still work, because the delimiter filter is in effect already doing exactly that: term1 produces term, 1 and term1. For other entries you may need to reduce the left hand side of the mapping down to just the part of the term that the delimiter filter outputs.

byronvoorbach commented 6 years ago

Thank you for your reply @romseygeek

Synonyms are created by people in the organization and loaded into a new index every 3 hours. Since we're not in full control of this (huge) file, plus the people that enter them don't have any knowledge of Elasticsearch internals, it's hard to filter out these synonyms before creating an index. This is currently an automated process. Is there any other way to fix this, other than having to delete the synonyms?

romseygeek commented 6 years ago

Is there any other way to fix this, other than having to delete the synonyms?

I think it would be possible to extend the SynonymMap parsing so that it could handle graph tokenstreams, but it wouldn't be simple. The other immediate workaround would be to see if you really need to have the word delimiter filter in there.

Kalle12345 commented 6 years ago

I have exactly same problem when using stopwords + synonym graph

sohaibiftikhar commented 6 years ago

Is this issue related to #30968?

romseygeek commented 6 years ago

Yes, I think #30968 will fix this

romseygeek commented 6 years ago

Or at least provide a workaround for cases where it's difficult to control and/or sanitise the synonyms list.

byronvoorbach commented 6 years ago

@romseygeek Not sure if I should add this to this issue or not, but I think just adding flag to ignore exceptions doesn't really fully cover the problem introduced by this check.

Take the following setup (stripped down version of actual production mapping in ES 6.3.2):

PUT test
{
   "settings": {
    "analysis": {
      "filter": {
        "delimiter": {
          "type": "word_delimiter",
          "catenate_all": true,
          "split_on_numerics": "true",
          "preserve_original": "true"
        },
        "word_breaks": {
          "type": "synonym",
          "synonyms": [
            "snowboard,snow board=>snow_board"
          ]
        }
      },
      "analyzer": {
        "match_analyzer_index": {
          "tokenizer": "whitespace",
          "filter": [
            "asciifolding",
            "lowercase",
            "delimiter",
            "word_breaks"
          ]
        },
        "match_analyzer_search": {
          "tokenizer": "whitespace",
          "filter": [
            "asciifolding",
            "lowercase",
            "delimiter",
            "word_breaks"
          ]
        }
      }
    }
  }
}

Trying to create this index fails with the following error: "term: snow_board analyzed to a token (snow) with position increment != 1 (got: 0)". Does it make sense for the new parsing to also apply tokenization to synonyms on the right hand side of the arrow? Note that this setup is without the graph versions of delimiter&synonyms.

jimczi commented 6 years ago

Trying to create this index fails with the following error: "term: snow_board analyzed to a token (snow) with position increment != 1 (got: 0)". Does it make sense for the new parsing to also apply tokenization to synonyms on the right hand side of the arrow?

It is required otherwise you'll index terms (snow_board) that you cannot search. I think that your problem here is different, you want to apply a word_delimiter and a synonym filter in the same chain but they don't work well together. The synonym and synonym_graph filter are not able to properly consume a stream that contains multiple terms at the same position (that's what the word_delimiter produces when preserve_original is set to true). You'll need to make sure that your synonym rules contains already delimited input/output. Regarding the lenient option, it works fine in this case, it ignores the snow_board rule when it is set to true and fails with an exception if false.

honzakral commented 6 years ago

This issue also occurs when you have a filter like hunspell which can, for some words, produce multiple variants of the same token. In our case using the nl_NL locale for hunspell and the alias rule fiets, stalen ros this completely breaks even though stalen ros are valid tokens that the hunspell doesn't remove. Instead it adds additional tokens (stal and staal) into the stream.

This check then doesn't prevent invalid unsearchable tokens but instead prevents a perfectly valid synonym from being used.

Also please note that with index-time synonyms it is quite common that people use a different search_analyzer without the synonyms which can produce different tokens so our assumptions on what you cannot search can be wrong.

reproduction:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "test": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "stem",
            "synonyms"
          ]
        }
      },
      "filter": {
        "stem": {
          "type": "hunspell",
          "locale": "nl_NL"
        },
        "synonyms": {
          "type": "synonym_graph",
          "synonyms": [
            "fiets, stalen ros"
          ]
        }
      }
    }
  }
}
byronvoorbach commented 6 years ago

@HonzaKral I think I choose a poor example, as my actual problem occurs with Dutch language. @jimczi This is a way better explanation of the same issue as I've been having. This change in synonym analysis is limiting the way synonyms can be used.

ehaubert commented 6 years ago

The problem is not limited to tokenization. The problem occurs when there is any preceeding graph filter. For example, if the synonyms have been broken into multiple files: layer_1.txt: dog => dog, canine

layer_2.txt: dogfood, dog food

Then you will encounter the same error without the "lenient" option. With lenient, the decompounding rule is not added, but ignored. This example is a little artificial, but 2 cents.

PUT test/_settings
{
        "settings": {
            "analysis": {
                "analyzer": {
                  "test_synonyms": {
                        "type": "custom",
                        "char_filter": [
                            "html_strip"
                        ],
                        "filter": [
                            "asciifolding",
                            "synonym_layer_1",
                            "flatten_graph",
                            "synonym_layer_2"
                        ],
                        "tokenizer":"whitespace"
                    }
                },
                "filter": {
                    "synonym_layer_1": {
                        "type": "synonym_graph",
                        "synonyms_path": "layer_1.txt"
                    },
                    "synonym_layer_2": {
                        "type": "synonym_graph",
                        "synonyms_path": "layer_2.txt"
                    }            
            }
        }
    }
}
romseygeek commented 5 years ago

Using the multiplexer filter might help here, I think. If we want to apply both word_delimiter and synonyms, but avoid them interacting with each other, we can put them into separate branches; rewriting the settings in the opening post yields this:

PUT test
{
  "settings": {
    "analysis": {
      "filter": {
        "delimiter_search": {
          "type": "word_delimiter_graph",
          "catenate_all": "true",
          "adjust_offsets" : "false"
        },
        "synonyms": {
          "type": "synonym_graph",
          "synonyms": [
            "test1=>test"
          ]
        },
        "split" : {
          "type" : "multiplexer",
          "filters" : [
            "delimiter_search,lowercase", 
            "lowercase,synonyms"]
        }
      },
      "analyzer": {
        "match_analyzer_search": {
          "tokenizer": "whitespace",
          "filter": [
            "trim",
            "asciifolding",
            "split"
          ]
        }
      }
    }
  }
}

Note that we need to set adjust_offsets to false in the delimiter_search filter, as otherwise we end up with backwards offsets. This happily tokenizes test1 into the tokens test1, test, 1 and test - the last one being the synonym.

dkln commented 5 years ago

Did somebody find a solution for this? This still seems to be a problem with ES 7.1

romseygeek commented 5 years ago

@dkln did you try the solution using the multiplexer detailed above?

dkln commented 5 years ago

Yes but that didn't seem to work. I ended up using a char_filter

PutziSan commented 5 years ago

In my case I could resolve the errors by setting "lenient": true for the synonym filter and "adjust_offsets": false for the delimiter filter. In my case I did not need multiplexer.

Before I got the error (with only "lenient": true):

startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=62,endOffset=69,lastStartOffset=63 for field ...

or without lenient:

term: analyzed (...) to a token (...) with position increment != 1 (got: 2)
YuanyeZ commented 5 years ago

I am having the same problems when I upgrade to ES 7, from ES 5: [word_delimiter] cannot be used to parse synonyms. I think in ES 5 we have warnings but stop working in ES 7. Anyone has some solutions ? The multiplexer solution @romseygeek mentioned will work but generate different tokens.

kesarevs commented 5 years ago

Got the same problem after upgrade from 6.5 to 7. In 6.5 it works as expected. Documentation does not cover this case.

Tried the multiplexer solution and got a lot of: IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=*,endOffset=*,lastStartOffset=* for field ...

nemphys commented 3 years ago

One more thing I have noticed in my case (don't know if I should open a different issue for it), is that this new "rule" is applied regardless of whether the analyzer with the invalid filter chain is actually used in the indexed fields.

I have 2 different analyzers in my index settings, one with an invalid filters order (synonym_graph filter after a multi-token producing filter) and one with a valid order (same filters, but reversed, so that the synonym_graph filter is before the multi-token producing filter). Even if I don't use the invalid analyzer in any indexed fields (and only use the valid one), I still get the "analyzed to a token (xxxx) with position increment != 1 (got: 0)" error when trying to create the index.

Only if I reverse the filter order in all analyzers (even the unused one), or completely remove the unused analyzer, does the index creation succeed.

This definitely looks like a bug.

nemphys commented 3 years ago

Is there any other way to fix this, other than having to delete the synonyms?

I think it would be possible to extend the SynonymMap parsing so that it could handle graph tokenstreams, but it wouldn't be simple. The other immediate workaround would be to see if you really need to have the word delimiter filter in there.

@romseygeek after a lot of digging, I believe that this is a quite important feature and implementing it would solve many synonym-related problems.

I have been banging my head to understand why the synonym graph filter produces wrong position increments when applied after a custom filter that produces multiple tokens at the same position (stacked positions), before reading the "Because entries in the synonym map cannot have stacked positions, ..." note at the footer of the synonym token filter documentation.

The proposed solution (to use a multiplexer with 2 branches) does not work in my case, since I want the synonym to be applied to one of the new tokens my custom filter inserts in the token stream (not the original token that would go to the synonyms multiplexer branch). Furthermore, it would be legit to want to use eg. 2 consecutive synonym_graph filters (eg. with a different set of rules each, for business logic reasons).

Bottomline is that the synonym_graph filter should be able to consume graphs and stacked positions, otherwise its use is very limited.

EDIT: Do you think that it would make sense to open a separate issue for this (if there isn't already one), so that it can be properly addressed?

elasticsearchmachine commented 4 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)