Open byronvoorbach opened 6 years ago
Thanks for your report! The root cause seems to be that since 6.0, the synonym_graph
filter, tokenizes synonyms with tokenizers / token filters that appear before it in the chain (see docs). As the word_delimiter_graph
is set to catenate_all = true
, the above error happens.
Honestly, I am not sure whether the behaviour here is intended or not, hence deferring to the search / aggs team for a definitive answer.
Pinging @elastic/es-search-aggs
@romseygeek could you take a look at this one?
@danielmitterdorfer is correct. We now use the preceding tokenizer chain to analyze terms in the synonym map, and word_delimiter_graph
is producing multiple tokens at the same position, which the map builder doesn't know how to handle.
In the case above, removing the term1=>term
mapping should still work, because the delimiter filter is in effect already doing exactly that: term1
produces term
, 1
and term1
. For other entries you may need to reduce the left hand side of the mapping down to just the part of the term that the delimiter filter outputs.
Thank you for your reply @romseygeek
Synonyms are created by people in the organization and loaded into a new index every 3 hours. Since we're not in full control of this (huge) file, plus the people that enter them don't have any knowledge of Elasticsearch internals, it's hard to filter out these synonyms before creating an index. This is currently an automated process. Is there any other way to fix this, other than having to delete the synonyms?
Is there any other way to fix this, other than having to delete the synonyms?
I think it would be possible to extend the SynonymMap parsing so that it could handle graph tokenstreams, but it wouldn't be simple. The other immediate workaround would be to see if you really need to have the word delimiter filter in there.
I have exactly same problem when using stopwords + synonym graph
Is this issue related to #30968?
Yes, I think #30968 will fix this
Or at least provide a workaround for cases where it's difficult to control and/or sanitise the synonyms list.
@romseygeek Not sure if I should add this to this issue or not, but I think just adding flag to ignore exceptions doesn't really fully cover the problem introduced by this check.
Take the following setup (stripped down version of actual production mapping in ES 6.3.2):
PUT test
{
"settings": {
"analysis": {
"filter": {
"delimiter": {
"type": "word_delimiter",
"catenate_all": true,
"split_on_numerics": "true",
"preserve_original": "true"
},
"word_breaks": {
"type": "synonym",
"synonyms": [
"snowboard,snow board=>snow_board"
]
}
},
"analyzer": {
"match_analyzer_index": {
"tokenizer": "whitespace",
"filter": [
"asciifolding",
"lowercase",
"delimiter",
"word_breaks"
]
},
"match_analyzer_search": {
"tokenizer": "whitespace",
"filter": [
"asciifolding",
"lowercase",
"delimiter",
"word_breaks"
]
}
}
}
}
}
Trying to create this index fails with the following error: "term: snow_board analyzed to a token (snow) with position increment != 1 (got: 0)"
. Does it make sense for the new parsing to also apply tokenization to synonyms on the right hand side of the arrow?
Note that this setup is without the graph versions of delimiter&synonyms.
Trying to create this index fails with the following error: "term: snow_board analyzed to a token (snow) with position increment != 1 (got: 0)". Does it make sense for the new parsing to also apply tokenization to synonyms on the right hand side of the arrow?
It is required otherwise you'll index terms (snow_board
) that you cannot search. I think that your problem here is different, you want to apply a word_delimiter
and a synonym
filter in the same chain but they don't work well together. The synonym
and synonym_graph
filter are not able to properly consume a stream that contains multiple terms at the same position (that's what the word_delimiter
produces when preserve_original
is set to true). You'll need to make sure that your synonym rules contains already delimited input/output.
Regarding the lenient
option, it works fine in this case, it ignores the snow_board
rule when it is set to true and fails with an exception if false.
This issue also occurs when you have a filter like hunspell
which can, for some words, produce multiple variants of the same token. In our case using the nl_NL
locale for hunspell
and the alias rule fiets, stalen ros
this completely breaks even though stalen ros
are valid tokens that the hunspell
doesn't remove. Instead it adds additional tokens (stal
and staal
) into the stream.
This check then doesn't prevent invalid unsearchable tokens but instead prevents a perfectly valid synonym from being used.
Also please note that with index-time synonyms it is quite common that people use a different search_analyzer
without the synonyms which can produce different tokens so our assumptions on what you cannot search can be wrong.
reproduction:
{
"settings": {
"analysis": {
"analyzer": {
"test": {
"tokenizer": "standard",
"filter": [
"lowercase",
"stem",
"synonyms"
]
}
},
"filter": {
"stem": {
"type": "hunspell",
"locale": "nl_NL"
},
"synonyms": {
"type": "synonym_graph",
"synonyms": [
"fiets, stalen ros"
]
}
}
}
}
}
@HonzaKral I think I choose a poor example, as my actual problem occurs with Dutch language. @jimczi This is a way better explanation of the same issue as I've been having. This change in synonym analysis is limiting the way synonyms can be used.
The problem is not limited to tokenization. The problem occurs when there is any preceeding graph filter. For example, if the synonyms have been broken into multiple files: layer_1.txt: dog => dog, canine
layer_2.txt: dogfood, dog food
Then you will encounter the same error without the "lenient" option. With lenient, the decompounding rule is not added, but ignored. This example is a little artificial, but 2 cents.
PUT test/_settings
{
"settings": {
"analysis": {
"analyzer": {
"test_synonyms": {
"type": "custom",
"char_filter": [
"html_strip"
],
"filter": [
"asciifolding",
"synonym_layer_1",
"flatten_graph",
"synonym_layer_2"
],
"tokenizer":"whitespace"
}
},
"filter": {
"synonym_layer_1": {
"type": "synonym_graph",
"synonyms_path": "layer_1.txt"
},
"synonym_layer_2": {
"type": "synonym_graph",
"synonyms_path": "layer_2.txt"
}
}
}
}
}
Using the multiplexer
filter might help here, I think. If we want to apply both word_delimiter
and synonyms
, but avoid them interacting with each other, we can put them into separate branches; rewriting the settings in the opening post yields this:
PUT test
{
"settings": {
"analysis": {
"filter": {
"delimiter_search": {
"type": "word_delimiter_graph",
"catenate_all": "true",
"adjust_offsets" : "false"
},
"synonyms": {
"type": "synonym_graph",
"synonyms": [
"test1=>test"
]
},
"split" : {
"type" : "multiplexer",
"filters" : [
"delimiter_search,lowercase",
"lowercase,synonyms"]
}
},
"analyzer": {
"match_analyzer_search": {
"tokenizer": "whitespace",
"filter": [
"trim",
"asciifolding",
"split"
]
}
}
}
}
}
Note that we need to set adjust_offsets
to false
in the delimiter_search
filter, as otherwise we end up with backwards offsets. This happily tokenizes test1
into the tokens test1
, test
, 1
and test
- the last one being the synonym.
Did somebody find a solution for this? This still seems to be a problem with ES 7.1
@dkln did you try the solution using the multiplexer
detailed above?
Yes but that didn't seem to work. I ended up using a char_filter
In my case I could resolve the errors by setting "lenient": true
for the synonym filter and "adjust_offsets": false
for the delimiter filter. In my case I did not need multiplexer
.
Before I got the error (with only "lenient": true
):
startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=62,endOffset=69,lastStartOffset=63 for field ...
or without lenient
:
term: analyzed (...) to a token (...) with position increment != 1 (got: 2)
I am having the same problems when I upgrade to ES 7, from ES 5: [word_delimiter] cannot be used to parse synonyms
. I think in ES 5 we have warnings but stop working in ES 7.
Anyone has some solutions ?
The multiplexer solution @romseygeek mentioned will work but generate different tokens.
Got the same problem after upgrade from 6.5 to 7. In 6.5 it works as expected. Documentation does not cover this case.
Tried the multiplexer solution and got a lot of: IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=*,endOffset=*,lastStartOffset=* for field ...
One more thing I have noticed in my case (don't know if I should open a different issue for it), is that this new "rule" is applied regardless of whether the analyzer with the invalid filter chain is actually used in the indexed fields.
I have 2 different analyzers in my index settings, one with an invalid filters order (synonym_graph filter after a multi-token producing filter) and one with a valid order (same filters, but reversed, so that the synonym_graph filter is before the multi-token producing filter). Even if I don't use the invalid analyzer in any indexed fields (and only use the valid one), I still get the "analyzed to a token (xxxx) with position increment != 1 (got: 0)" error when trying to create the index.
Only if I reverse the filter order in all analyzers (even the unused one), or completely remove the unused analyzer, does the index creation succeed.
This definitely looks like a bug.
Is there any other way to fix this, other than having to delete the synonyms?
I think it would be possible to extend the SynonymMap parsing so that it could handle graph tokenstreams, but it wouldn't be simple. The other immediate workaround would be to see if you really need to have the word delimiter filter in there.
@romseygeek after a lot of digging, I believe that this is a quite important feature and implementing it would solve many synonym-related problems.
I have been banging my head to understand why the synonym graph filter produces wrong position increments when applied after a custom filter that produces multiple tokens at the same position (stacked positions), before reading the "Because entries in the synonym map cannot have stacked positions, ..." note at the footer of the synonym token filter documentation.
The proposed solution (to use a multiplexer with 2 branches) does not work in my case, since I want the synonym to be applied to one of the new tokens my custom filter inserts in the token stream (not the original token that would go to the synonyms multiplexer branch). Furthermore, it would be legit to want to use eg. 2 consecutive synonym_graph filters (eg. with a different set of rules each, for business logic reasons).
Bottomline is that the synonym_graph filter should be able to consume graphs and stacked positions, otherwise its use is very limited.
EDIT: Do you think that it would make sense to open a separate issue for this (if there isn't already one), so that it can be properly addressed?
Pinging @elastic/es-search-relevance (Team:Search Relevance)
Hi there,
In the process of upgrading one of my clients from ES5.5.1 -> ES6.2.3, I got into an issue when trying to create an index in ES6. I worked out a small snippet to highlight my issue:
Generates the following error:
The actual error is in my custom synonym file, but I managed to reproduce with a single term. If I remove
delimiter_search
from the analyzer, there are no problems creating the index. The above works in ES5.5.1