FVH highlighter duplicates the highlight text

jacool commented 6 years ago

Elasticsearch version (bin/elasticsearch --version): 6.2.3

Plugins installed: []

JVM version (java -version): openjdk version "1.8.0_161"

OS version (uname -a if on a Unix-like system): Linux 5137c3a21142 4.9.87-linuxkit-aufs

Description of the problem including expected versus actual behavior: When using the "sentence" mode with FVH highlighter some highlight texts are returned fully or partially duplicated. See the reproduction example below. The second returned highlight contains the first one fully, thus users would expect only one highlight being returned (the second one) with the first occurrence of the word "go" emphasized as well as the second occurrence. (As is usually the case with this highlighter when the searched word appears nearby several times). Another acceptable solution would be that the first sentence would not appear in the second highlight at all.

Expected result: "I don't have access to his calendar but let me <em>go</em> and have a chat to him because I'm I'm really came to get just to get this up and running. So if let me <em>go</em> on the when he comes in I'll have to put it down and try and look at a time to do this and then I'll let you know as soon as possible candidates. "

In this specific case there is a partial duplication of highlight texts, we have observed full duplication as well in other cases.

Steps to reproduce:

PUT /test
{
  "mappings": {
    "t": {
      "properties": {
        "message": {
          "type": "text",
          "term_vector": "with_positions_offsets"
        }
      }
    }
  }
}

POST /test/t/1
{
    "message": "I don't have access to his calendar but let me go and have a chat to him because I'm I'm really came to get just to get this up and running. So if let me go on the when he comes in I'll have to put it down and try and look at a time to do this and then I'll let you know as soon as possible candidates."
}

GET /test/_search
{
  "version": true,
  "query": {
    "match_phrase": {
      "message": {
        "query": "go"
      }
    }
  },
  "highlight": {
    "fields": {
      "message": {
        "boundary_max_scan":100,
        "fragment_size":100,
        "type":"fvh",
        "number_of_fragments":20,
        "boundary_scanner":"sentence"
      }
    }
  }
}

Results:

"highlight": {
          "message": [
            "I don't have access to his calendar but let me <em>go</em> and have a chat to him because I'm I'm really came to get just to get this up and running. ",
            "I don't have access to his calendar but let me go and have a chat to him because I'm I'm really came to get just to get this up and running. So if let me <em>go</em> on the when he comes in I'll have to put it down and try and look at a time to do this and then I'll let you know as soon as possible candidates. "
          ]
        }

elasticmachine commented 6 years ago

Pinging @elastic/es-search-aggs

mayya-sharipova commented 6 years ago

@jacool Thanks for reporting this issue. I can reproduce this as well.

@jimczi The bug is on Lucene level, how fragments are being built in BaseFragmentsBuilder.java. First, we get temporary fragInfos with the offsets [0,100] and [105,205] with requested fragment_size. Then here, we go through each fragInfo, and building textual fragments from it, using Java's sentence breakIterator. Java's breakIterator finds the end_offset of the 1st textual fragment as 141. But we don't update the second fragInfo with this new information, that it now should start with 142, and not 105 anymore.

@jimczi do you think, we should be fixing it, or with will wait when @romseygeek rewrites highlighters with his new match info?

jimczi commented 6 years ago

@jimczi do you think, we should be fixing it, or with will wait when @romseygeek rewrites highlighters with his new match info?

The unified highlighter should produce the expected snippets so if the fix is not trivial I'd advise to switch to the default highlighter in 6. @jacool any reason to use the fvh highlighter rather than the unified ? The unified highlighter automatically detects if a field is indexed with term_vectors and can also detect sentences so you should be able to get what you want.

jacool commented 6 years ago

@jimczi The unified highlighter is of no value to us due to Issue #29561

jimczi commented 6 years ago

ok thanks @jacool, #29561 needs a fix in Lucene, I'll dig.

mayya-sharipova commented 6 months ago

Still an issue in Elasticsearch 8.12 with fvh highlighter, no such problem for unified highlighter

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)

elastic / elasticsearch

FVH highlighter duplicates the highlight text #29654