elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.59k stars 24.63k forks source link

Fast Vector Highlighter appears to fail under certain circumstances (possible regression) #107352

Closed Mrodent closed 4 months ago

Mrodent commented 5 months ago

Elasticsearch Version

8.13.1

Installed Plugins

No response

Java Version

bundled

OS Version

W10

Problem Description

In 7.10.2 the code in steps to reproduce works with a Fast Vector Highlighter to produce highlighted text highlighted beautifully in multiple colours, a different colour for each word in the search string. But the code (after necessary slight modification of the query grammar) works only partially in 8.13.1.

Steps to Reproduce

Given this mapping, using 7.10.2, this works:

       mappings = \
        {
          "mappings": {
            "properties": {
              "esdoc_text": {
                "type": "text",
                "term_vector": "with_positions_offsets",
                  "fields": {
                    "stemmed": {
                      "type": "text",
                      "analyzer": "english",
                      "term_vector": "with_positions_offsets",
                    }
                  }
                }
              }
            }
          }

and this query DSL dict:

       data = \
        {
          'query': {
            'simple_query_string': {
              'query': self.search_string,
              'fields': [
                self.text_field
              ]
            }
          },
          'highlight': {
            'fields': {
              self.text_field: {
               'type': 'fvh',
               'pre_tags': [
                    '<span style="background-color: yellow">',
                    '<span style="background-color: skyblue">', 
                    '<span style="background-color: lightgreen">', 
                    '<span style="background-color: plum">', 
                    '<span style="background-color: lightcoral">', 
                    '<span style="background-color: silver">',
                ],
               'post_tags': [
                    '</span>', '</span>', '</span>', 
                    '</span>', '</span>', '</span>', 
                ]
              }
            },
            'number_of_fragments': 0
          }
        }

... it delivers beautiful multi-coloured highlighting (highlighting in different colours for each term comprising the query string, even when using stemming)

In 8.13.1 it appears necessary to add the field "matched_fields". This then works.

'highlight': {
    'fields': {
        'text_content.stemmed': {
            'matched_fields': ['text_content.stemmed'],
            'type': 'fvh',
            'pre_tags' : ['<span style="background-color: yellow">', 
                '<span style="background-color: skyblue">', ], 
            'post_tags' : ['</span>', '</span>', ],
        }
    },
}

However... according to my experiments, it is not possible to stipulate more than four colours! If I include 5 colours (with 5 closing span tags, obviously) like so...

'highlight': {
    'fields': {
        'text_content.stemmed': {
            'matched_fields': ['text_content.stemmed'],
            'type': 'fvh',
            'pre_tags' : ['<span style="background-color: yellow">', 
                '<span style="background-color: skyblue">', 
                '<span style="background-color: plum">', 
                '<span style="background-color: lightgreen">', 
                '<span style="background-color: blue">', ], 
            'post_tags' : ['</span>', '</span>',  '</span>',  '</span>',  '</span>', ],
        }
    },
}

... I then get this error:

{
  "error": {
    "root_cause": [
      {
        "type": "x_content_parse_exception",
        "reason": "[1:496] [highlight_field] failed to parse field [post_tags]"
      }
    ],
    "type": "x_content_parse_exception",
    "reason": "[1:496] [highlight] failed to parse field [fields]",
    "caused_by": {
      "type": "x_content_parse_exception",
      "reason": "[1:496] [fields] failed to parse field [text_content.stemmed]",
      "caused_by": {
        "type": "x_content_parse_exception",
        "reason": "[1:496] [highlight_field] failed to parse field [post_tags]",
        "caused_by": {
          "type": "json_e_o_f_exception",
          "reason": "Unexpected end-of-input in VALUE_STRING\n at [Source: (byte[])\"{\"query\": {\"simple_query_string\": {\"query\": \"linux one two\", \"fields\": [\"text_content.stemmed\"]}}, \"highlight\": {\"fields\": {\"text_content.stemmed\": {\"matched_fields\": [\"text_content.stemmed\"], \"type\": \"fvh\", \"pre_tags\": [\"<span style=\\\"background-color: yellow\\\">\", \"<span style=\\\"background-color: skyblue\\\">\", \"<span style=\\\"background-color: lightgreen\\\">\", \"<span style=\\\"background-color: plum\\\">\", \"<span style=\\\"background-color: blue\\\">\"], \"post_tags\": [\"</span>\", \"</span>\", \"</span>\", \"</sp\"[truncated 3 bytes]; line: 1, column: 504]"
        }
      }
    }
  },
  "status": 400
}

... is this intended? Documented? Can it be changed? As shown, I use 6 colours with the 7.10.2 version.

I haven't yet looked at the source code but I wonder whether this is due to something a bit basic, like a limitation on the total length of the strings involved in "pre_tags" and "post_tags"? I see the error says "truncated 3 bytes" ...

NB I also note some odd actions in the highlighting: for example, intermittently, sometimes if I have more than 4 words in the query text (but 4 highlighting colours stipulated) I find that only 3 colours are used for highlighting. So the 4th word will be highlighted with the 1st colour, the 5th term with another, e.g. the 2nd.

However a 4-word query text is always (seemingly) nicely coloured with all 4 colours.

The ideal result would be, with 4 colours given, that a 5th word in the query string should be coloured again with the 1st colour, a 6th word with the 2nd colour, etc., i.e. that it should iterate repeatedly through the colours simply, predictably, deterministically.

Logs (if relevant)

No response

elasticsearchmachine commented 5 months ago

Pinging @elastic/es-search (Team:Search)

Mrodent commented 4 months ago

Bumping this... Anyone at the ES inside team take an interest in this? It does seem hobbled, if not broken. And it also appears to represent a regression, since with 7.10.2 you can have, I think, as many colours as you want.

Also please note that this regression was first raised (by me) in relation to 7.16.3, in 2022-03, issue 84690. The problem was assigned, to romseygeek (ES team member), and duly reported as fixed.

But something still seems to need mending.

benwtrent commented 4 months ago

@Mrodent I just tried replicating your failure in a Kibana console in 8.13 and could not.

# Create the index
PUT test_highlight
{
  "mappings": {
    "properties": {
      "esdoc_text": {
        "type": "text",
        "term_vector": "with_positions_offsets",
        "fields": {
          "stemmed": {
            "type": "text",
            "analyzer": "english",
            "term_vector": "with_positions_offsets"
          }
        }
      }
    }
  }
}

# index some docs
POST test_highlight/_doc
{
  "esdoc_text": "The fox jumped over the lazy dog"
}

# search over all fields for fox
POST _search
{
  "query": {
    "simple_query_string": {
      "query": "fox",
      "fields": []
    }
  },
  "highlight": {
    "fields": {
      "esdoc_text.stemmed": {
        "type": "fvh",
        "pre_tags": [
          """<span style="background-color: yellow">""",
          """<span style="background-color: skyblue">""",
          """<span style="background-color: lightgreen">""",
          """<span style="background-color: plum">""",
          """<span style="background-color: lightcoral">""",
          """<span style="background-color: silver">"""
        ],
        "post_tags": [
          "</span>",
          "</span>",
          "</span>",
          "</span>",
          "</span>",
          "</span>"
        ]
      }
    },
    "number_of_fragments": 0
  }
}

Is this failure repeatable for you executing directly in the kibana console? If so, could you provide some minimal steps for reproducing?

If you cannot provide the steps, could you return the stack trace for the failure? The way to do this is to provide error_trace URL parameter which will return the full stack trace of the failure.

Mrodent commented 4 months ago

Thanks for the response. It usually occurs when the search string has multiple terms: I'd try it with 3, 4, 5 etc. However, I don't have Kibana installed, but I do have Insomnia ... and testing there (sorry, should have thought of this before!) seems to confirm what you are saying: rock solid as far as I can tell.

So my working assumption is now that my code is somehow messing things up as I send the request with the query. I am in fact using a utility method to do that: prime suspect for the moment.

benwtrent commented 4 months ago

@Mrodent getting quotations and escapes correct is a pain for sure :). It would be interesting to see the full request printed out exactly as it is sent to Elasticsearch. That would help narrow down the cause.

Mrodent commented 4 months ago

@benwtrent Solved the problem ... You'll see in my original post that I was using single quotes constructing the json dict. This was the problem! In fact this is a mixed English Analyzer and Greek Analyzer situation: both stemmer fields are attached to the (normalised) field text, so I can do a mixture of Latin script words and Greek script words in my query.

But in fact any amount of Greek in the query string was still causing the problem, in Python, even by calling requests.get(...) directly ... but not in Insomnia.

Changing to double quotes in the data dict constructing the query solved it in the Python. I find this very puzzling!

Anyway thanks for your help and suggestion to do a direct call to the server.