Open mboynes opened 1 year ago
If I'm understanding the discussion on the linked PR, looks like a full solution hasn't been merged upstream yet.
@rebeccahum correct. And it's unclear as of yet if the likely response (adding default_search analyzers separate from the default indexing analyzers) will actually address the issue, or just work around it for the only "out-of-the-box" feature a user can enable, synonyms. Separating search and index analyzers still requires changing this setting for indexing operations. At the risk of being a broken record, the current configuration is invalid, and creates a problem regardless of whether or not something else (such as search-as-you-type, synonyms, highlighting, etc.) surfaces it.
VIP: This was PRed upstream, though it was closed. I would encourage VIP to adopt this change in your fork regardless.
Description
This removes the
preserve_original
option for theword_delimiter_graph
token filter. As noted in the ES docs:As it was set, this was producing multi-position tokens, which could lead to unexpected results and, potentially, indexing errors. Where I observed this specifically was using a
search_as_you_type
field. This uses shingle token filters to create 2- and 3-gram sub-fields. Combined withword_delimiter_graph.preserve_original = true
, if the field text is a word like "WordPress", and the analyzed token count is <= the gram size, the tokens can end up with a negative token position and indexing fails.For what it's worth, I tried using
flatten_graph
as the docs suggest as a workaround, and that didn't work 🤷.Checklist
Please make sure the items below have been covered before requesting a review:
Steps to Test
To replicate the error in isolation:
Observe that when removing
"preserve_original": true
, the document indexes as expected and the position is correctly calculated as 0, not -1.