Open jacool opened 6 years ago
Pinging @elastic/es-search-aggs
@jacool Thanks for reporting this issue. I can reproduce this as well.
@jimczi The bug is on Lucene level, how fragments are being built in BaseFragmentsBuilder.java.
First, we get temporary fragInfos
with the offsets [0,100] and [105,205] with requested fragment_size
.
Then here, we go through each fragInfo
, and building textual fragments from it, using Java's sentence breakIterator. Java's breakIterator finds the end_offset of the 1st textual fragment as 141
. But we don't update the second fragInfo
with this new information, that it now should start with 142
, and not 105
anymore.
@jimczi do you think, we should be fixing it, or with will wait when @romseygeek rewrites highlighters with his new match
info?
@jimczi do you think, we should be fixing it, or with will wait when @romseygeek rewrites highlighters with his new match info?
The unified
highlighter should produce the expected snippets so if the fix is not trivial I'd advise to switch to the default highlighter in 6. @jacool any reason to use the fvh
highlighter rather than the unified
? The unified
highlighter automatically detects if a field is indexed with term_vectors
and can also detect sentences so you should be able to get what you want.
@jimczi The unified highlighter is of no value to us due to Issue #29561
ok thanks @jacool, #29561 needs a fix in Lucene, I'll dig.
Still an issue in Elasticsearch 8.12 with fvh
highlighter, no such problem for unified
highlighter
Pinging @elastic/es-search-relevance (Team:Search Relevance)
Elasticsearch version (
bin/elasticsearch --version
): 6.2.3Plugins installed: []
JVM version (
java -version
): openjdk version "1.8.0_161"OS version (
uname -a
if on a Unix-like system): Linux 5137c3a21142 4.9.87-linuxkit-aufsDescription of the problem including expected versus actual behavior: When using the "sentence" mode with FVH highlighter some highlight texts are returned fully or partially duplicated. See the reproduction example below. The second returned highlight contains the first one fully, thus users would expect only one highlight being returned (the second one) with the first occurrence of the word "go" emphasized as well as the second occurrence. (As is usually the case with this highlighter when the searched word appears nearby several times). Another acceptable solution would be that the first sentence would not appear in the second highlight at all.
Expected result:
"I don't have access to his calendar but let me <em>go</em> and have a chat to him because I'm I'm really came to get just to get this up and running. So if let me <em>go</em> on the when he comes in I'll have to put it down and try and look at a time to do this and then I'll let you know as soon as possible candidates. "
In this specific case there is a partial duplication of highlight texts, we have observed full duplication as well in other cases.
Steps to reproduce:
Results: