elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.45k stars 24.88k forks source link

FVH highlighter fails to properly highlight multi phrase query #105077

Open mayya-sharipova opened 9 months ago

mayya-sharipova commented 9 months ago

FVH highlighter fails to properly highlight multi phrase query with many terms.

Elasticsearch Version

V8.12

Steps to reproduce

PUT index1
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "term_vector": "with_positions_offsets",
        "analyzer": "english"
      }
    }
  }
}

// When indexing just a single document: 

POST index1/_bulk?refresh=true
{ "index" : {"_id": 1} }
{"content": "Random Mental Health Screen Random"}

// And doing search
GET index1/_search
{
  "query": {
    "bool": {
      "minimum_should_match": 1,
      "should": [
        {
          "match_phrase_prefix": {
            "content": {
              "query": "mental health screen",
              "slop": 2
            }
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "content": {
        "matched_fields": ["content"],
        "type": "fvh"
      }
    }
  }
}

We get the expected output for highlight:

"highlight": {
          "content": [
            "Random <em>Mental Health Screen</em> Random Ranom material. With ScreenA in project screen random random."
          ]
        }

But when many docs indexed:

POST index1/_bulk?refresh=true
{ "index" : {"_id": 1} }
{"content": "Random Mental Health Screen Random Ranom material. With ScreenA in project screen random random."}
{ "index" : {"_id": 2} }
{ "content": "screenin"}
{ "index" : {"_id": 3} }
{ "content": "screen:non"}
{ "index" : {"_id": 4} }
{ "content": "screenn"}
{ "index" : {"_id": 5} }
{"content": ":screenina"}
{ "index" : {"_id": 6} }
{ "content": "screeningst:screen"}
{ "index" : {"_id": 7} }
{"content": "screeningh"}
{ "index" : {"_id": 8} }
{"content": "screenincjwritten"}
{ "index" : {"_id": 9} }
{"content": ":screen___abg"}
{ "index" : {"_id": 10} }
{"content": "screenshot"}
{ "index" : {"_id": 11} }
{ "content": "screeninb"}
{ "index" : {"_id": 12} }
{"content": "screenihg"}
{ "index" : {"_id": 13} }
{"content": "screen"}
{ "index" : {"_id": 14} }
{"content": "screeninq"}
{ "index" : {"_id": 15} }
{ "content": "screen:lab"}
{ "index" : {"_id": 16} }
{ "content": "screeninci"}
{ "index" : {"_id": 17} }
{"content": "screener"}

the output is incorrect:

"highlight": {
          "content": [
            "Random <em>Mental</em> <em>Health</em> <em>Screen</em> Random Ranom material. With <em>ScreenA</em> in project <em>screen</em> random random."
          ]
        }

elasticsearchmachine commented 9 months ago

Pinging @elastic/es-search (Team:Search)

mayya-sharipova commented 9 months ago

This happens because the way we rewrite multi phrase queries in CustomFieldQuery. When there are more than 16 terms for a wildcard part: "screen", we would rewrite phrase query as individual term queries. This is an old change that was done to protect a node against going out of memory in case of huge phrase queries.


The workaround 1: instead use unified highlighter. unified highlighter produces the expected output. Limitation: unified highlighter doesn't work with matched_fields option.

"highlight": {
    "fields": {
      "content": {
        "matched_fields": ["content"],
        "type": "unified"
      }
    }
  }

Another workaround is still use the fvh highlighter but with smaller max_expansion value. Limitation: as terms for expansion are chosen alphabetically, queries are restricted to those terms.

GET index1/_search
{
  "query": {
    "match_phrase_prefix": {
      "content": {
        "query": "mental health screen",
        "slop": 2,
        "max_expansions": 13
      }
    }
  },
  "highlight": {
    "fields": {
      "content": {
        "matched_fields": [
          "content"
        ],
        "type": "fvh"
      }
    }
  }
}
elasticsearchmachine commented 4 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)