Open chongzhe opened 10 years ago
I would be interested in this too, if it's technically possible.
My usecase would be: the string I'm indexing/analyzing/highlighting has additional annotational data I already use to mark up other elements and when receiving the highlighting information I've to combined both.
An example:
Contains a http://link.to.somewhere/ and @someuser someone
Contains a <a href="http://link.to.somewhere/">http://link.to.somewhere/</a> and <a href="https://some.where/profile/231321131">@someuser</a> someone
some*
I've to include this now appropriately.Having based on my indexed data the raw offsets available would help. It is possible to do on the client by using special highlight tags to parse out that information but it would be easier if it would already be sent from ES.
Example query:
{
"query": {
"wildcard": {
"field": {
"value": "some*"
}
}
},
"highlight": {
"fields": {
"field": {
"number_of_fragments": 0,
"return_as": "offsets"
}
}
}
}
Example result payload:
{
"hits": {
"hits": [
{
"_index": "myindex",
"_type": "Something",
"_id": "asd92jsdfodsfsf",
"_score": 1,
"_source": {
"field": "Contains a http://link.to.somewhere/ and @someuser someone",
},
"highlight": {
"field": [
{
"start": 42,
"lenth": 8
}
]
}
},
My imaginary return_as
would denote the format:
content
would be the current behaviouroffsets
would return the start
and length
tuple objectscontent_offsets
would return both, e.g."highlight": {
"field": [
{
"start": 42,
"length": 8,
"content": "<em>someuser</em>"
}
]
}
As can be seen due the additional data the format has to changed to from an array of string to an array of objects.
This was just quickly on top of my head. I bet I forgot to think about a ton of other information :-)
+1
It's been more than a year since the issue is opened... is there any update on this?
I just cracked open eclipse to implement it in the experimental highlighter plugin. But now I'm playing with my kids. On May 8, 2015 7:06 PM, "chongzhe" notifications@github.com wrote:
It's been more than a year since the issue is opened... is there any update on this?
— Reply to this email directly or view it on GitHub https://github.com/elastic/elasticsearch/issues/5736#issuecomment-100391225 .
And patch proposed: https://gerrit.wikimedia.org/r/#/c/209956
If you are willing to use the plugin then that might be enough for you.
I think @mfn can get what he wants from it using the none
fragmenter. One thing, though, is that Elasticsearch limits me to returning text as the result of the highlight request so I can't make fancy json and you'll have to breakout the Splitters.
Wow, thanks for the effort.
Elasticsearch limits me to returning text as the result of the highlight request so I can't make fancy json
Sad to hear, that's a bit of a bummer though. I'm trying to deduce a possible format from the test case result, could it look like this?
0:0-5,18-22:22
{
"start_offset": 0,
"end_offset": 22,
"hits": [
{
"start_offset": 0,
"end_offset": 5,
},
{
"start_offset": 18,
"end_offset": 22,
}
]
}
23:33-37:37
{
"start_offset": 23,
"end_offset": 27,
"hits": [
{
"start_offset": 33,
"end_offset": 37,
}
]
}
I understand it closely resembles the Snippet
class, but some remarks:
start_offset + length = end_offset
instead of the negative approach of end_offset - start_offset = length
, i.e. preferring having the length
instead of the end; simply because most string APIs I work with, work with lengthscontent
was which was highlighted, which doesn't seem to be available (see https://github.com/elastic/elasticsearch/issues/5736#issuecomment-98389774 )Well, nothing really of value to add I guess, still glad someones igniting the fire :-)
+1 how to use?not fix?
Wonder why this issue still remains Open for more than 3 years. @javanna Any milestone for this helpful feature bro?
Highlighting is just not at the top of anyone's list. And highlighting is complicated because there are 4 highlighters so every feature needs to be implemented four times. One day, someone is going to start really caring about highlighting again and will probably go and do this. As you can see from the issue history, highlighting used to be a big deal to me, now I have other things I spend my days working on.
Any updates on this issue?
+1 this would be very helpful
cc @elastic/es-search-aggs
+1 any updates? We want to avoid running our own custom highlighter on top of ES results to avoid an increase in time complexity. Hence, would be great if ES can provide it out of the box.
We talked about it in #29631 but didn't come to a good conclusion. I think we'll spend more time on this soon though.
Any updates on this? Our users could profit a lot from knowing where in a document their hits were generated. We'll probably look into implementing our own solution otherwise, but it's a bit unfortunate if it seems like all the info is already there on elastic side.
Any updates?
A possible workaround this could be to decorate the text with a text marker, which can later be removed. We use this approach for storing metadata on per-word basis and has so far been decent.
Any updates on implementing this as a feature in elasticsearch itself? I'm working with the plugin but this means I always have to setup ES manually...
Any updates on this? Can we bump this up in priority?
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
It would be nice to have this feature. In our particular use case, we will need postprocessing based on searching the snippet text in the full property field (avoiding possible repetitions, mismatches, etc.). This is especially 'frustrating' when the documentation itself shows that this info is available and even used to calculate the highlighted words. It would be nice to have an optional config parameter to just return this offset info too.
The last part of this highlighting documentation section:
For our example, we have a single passage with the following properties (showing only a subset of the properties here):
Passage:
startOffset: 147
endOffset: 189
score: 3.7158387
matchStarts: [159, 164]
matchEnds: [163, 167]
numMatches: 2
Notice how a passage has a score, calculated using the BM25 scoring formula adapted for passages. Scores allow us to choose the best scoring passages if there are more passages available than the requested by the user number_of_fragments. Scores also let us to sort passages by order: "score" if requested by the user.
As the final step, the unified highlighter will extract from the field’s text a string corresponding to each passage:
"I'll be the only fox in the world for you."
and will format with the tags <em> and </em> all matches in this string using the passages’s matchStarts and matchEnds information:
I'll be the <em>only</em> <em>fox</em> in the world for you.
This kind of formatted strings are the final result of the highlighter returned to the user.
Anyway, thank you for the ES ecosystem. It is just a suggestion for an a priori small change that leads to an old (2014) and very useful feature for many people, as seen in all the previous comments.
+1 Even SQLite can do this.
+1 This feature will be very helpful for all use cases where you want to find the exact highlighted text and in my case for example I need to find the preceding and succeeding n token before each highlight. This will be significantly easier if offsets are return instead of me re parsing with complex logic to find the tags (especially hard for match phrase query with multiple tokens and multiple matches in a single document - also because of the bug - https://github.com/elastic/elasticsearch/issues/29561).
+1
+1
Currently the highlighting result is returned in between pre-tag and post-tag in a context. Is it possible to provide an api that returns the offset of this highlighted term in the field of the document so more flexible operation can be done?