Fixes bugs from word delimiter graph filter

mboynes commented 1 year ago

VIP: This was PRed upstream, though it was closed. I would encourage VIP to adopt this change in your fork regardless.

Description

This removes the preserve_original option for the word_delimiter_graph token filter. As noted in the ES docs:

Setting this parameter to true produces multi-position tokens, which are not supported by indexing.

If this parameter is true, avoid using this filter in an index analyzer or use the flatten_graph filter after this filter to make the token stream suitable for indexing.

As it was set, this was producing multi-position tokens, which could lead to unexpected results and, potentially, indexing errors. Where I observed this specifically was using a search_as_you_type field. This uses shingle token filters to create 2- and 3-gram sub-fields. Combined with word_delimiter_graph.preserve_original = true, if the field text is a word like "WordPress", and the analyzed token count is <= the gram size, the tokens can end up with a negative token position and indexing fails.

For what it's worth, I tried using flatten_graph as the docs suggest as a workaround, and that didn't work 🤷.

Checklist

Please make sure the items below have been covered before requesting a review:

[x] This change works and has been tested locally (or has an appropriate fallback).
[ ] This change works and has been tested on a Go sandbox.
[ ] This change has relevant unit tests (if applicable).
[ ] This change has relevant documentation additions / updates (if applicable).
[x] This change has the fix PRed upstream (if applicable). If not applicable, it has the relevant "// VIP: reason for the discrepancy with upstream" comment in places where the code is discrepant.

Steps to Test

To replicate the error in isolation:

PUT sayt-demo
{
  "settings": {
    "analysis": {
      "filter": {
        "word_delimiter": {
          "type": "word_delimiter_graph",
          "preserve_original": true
        }
      },
      "analyzer": {
        "default": {
          "tokenizer": "standard",
          "filter": [
            "word_delimiter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "test": {
        "type": "search_as_you_type"
      }
    }
  }
}

PUT sayt-demo/_doc/1?refresh
{
  "test": "WordPress"
}

RESPONSE:
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "first position increment must be > 0 (got 0) for field 'test._2gram'"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "first position increment must be > 0 (got 0) for field 'test._2gram'"
  },
  "status" : 400
}

GET /sayt-demo/_analyze
{
  "field": "test._2gram",
  "text": "WordPress"
}

RESPONSE:
{
  "tokens" : [
    {
      "token" : "Word Press",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "shingle",
      "position" : -1
    }
  ]
}

Observe that when removing "preserve_original": true, the document indexes as expected and the position is correctly calculated as 0, not -1.

rebeccahum commented 1 year ago

If I'm understanding the discussion on the linked PR, looks like a full solution hasn't been merged upstream yet.

mboynes commented 1 year ago

@rebeccahum correct. And it's unclear as of yet if the likely response (adding default_search analyzers separate from the default indexing analyzers) will actually address the issue, or just work around it for the only "out-of-the-box" feature a user can enable, synonyms. Separating search and index analyzers still requires changing this setting for indexing operations. At the risk of being a broken record, the current configuration is invalid, and creates a problem regardless of whether or not something else (such as search-as-you-type, synonyms, highlighting, etc.) surfaces it.

Automattic / ElasticPress