Closed paulintrognon closed 7 years ago
To add to this:
I think there is an issue with how spans are calculated for higlighting in the plain highlighter when using the span fragmenter and multiple phrases, in this case it seems to lose track of spans and practically collects a huge fragment at the end, which, having multiple matches and as such higher score, will be the top "match" to highlight.
@paulintrognon, I guess while the root cause is fully established, you could try using "fragmenter" : "simple"
explicitly, which will not try to be smart about keeping matches of various parts of your search together.
E.g. try this:
POST /test/chapter/_search
{
"fields": "title",
"query": {
"match_phrase": {
"content": {
"query": "credit card",
"slop": 0
}
}
},
"highlight": {
"fields": {
"content": {
"fragment_size": 50,
"number_of_fragments": 5,
"type": "plain",
"fragmenter" : "simple"
}
}
}
}
I think this provides relatively useful matches respecting the fragment size.
@markharwood you were looking at the plain highlighter recently. Could you investigate this one too please?
Will do
Looking deeper the SimpleSpanFragmenter is prioritizing accuracy of match-reporting over delivering the requested sizes of fragments. It is not possible to both summarize and reflect all query logic accurately and this is an example case where the trade-off being made is questionable - in the pursuit of accurately reporting phrase matches this fragmenter temporarily overrides the fragment size limit to try and tie together reported sightings of phrase components which otherwise would straddle fragments introduced by summarization logic.
I can see from the code there is a deliberate policy of ignoring fragment sizes while connecting phrase elements in this query/doc so the override of your choice of fragment size is "working as designed" but arguably not doing a fantastic job.
For the record - the wikimedia foundation's "experimental-highlighter" plugin that they use on Wikipedia looks to do a decent job of summarising this text and is known to be significantly faster than the plain highlighter. See https://github.com/wikimedia/search-highlighter
OK, this is very interesting.
I have used the simple fragmenter to get arround this issue, and it worked for me. Thank you all!
This appears to be fixed in 5.0 or before
Hi there,
I am trying to get highlights from a field that contains a lot of english text. I built an index that uses the
standard
analyzer with the english stopwords. The highlights I get when I perform a search can be very short or quite big, regardless of thefragment_size
option.Here is my test setup:
Settings
POST /test
Mapping
POST /test/chapter/_mapping
Test document (with a big content)
POST /test/chapter
ES Query
POST /test/chapter/_search
If you try this out, you'll see that the highlight returned is as big as the _source.content, it just doesn't care about the
"fragment_size": 10
.Thank you for investigating this issue!