[Feature Request] : get the offset of a highlighted field

chongzhe commented 10 years ago

Currently the highlighting result is returned in between pre-tag and post-tag in a context. Is it possible to provide an api that returns the offset of this highlighted term in the field of the document so more flexible operation can be done?

mfn commented 9 years ago

I would be interested in this too, if it's technically possible.

My usecase would be: the string I'm indexing/analyzing/highlighting has additional annotational data I already use to mark up other elements and when receiving the highlighting information I've to combined both.

An example:

String I'm indexing:
Contains a http://link.to.somewhere/ and @someuser someone
Due my own annotational information this gets turned into:
Contains a <a href="http://link.to.somewhere/">http://link.to.somewhere/</a> and <a href="https://some.where/profile/231321131">@someuser</a> someone
And mix this in with the highlighted result of ES when e.g. searching for some* I've to include this now appropriately.

Having based on my indexed data the raw offsets available would help. It is possible to do on the client by using special highlight tags to parse out that information but it would be easier if it would already be sent from ES.

Example query:

{
  "query": {
    "wildcard": {
      "field": {
        "value": "some*"
      }
    }   
  },
  "highlight": {
    "fields": {
      "field": {
        "number_of_fragments": 0,
        "return_as": "offsets"
      }
    }
  }
}

Example result payload:

{
   "hits": {
      "hits": [
         {
            "_index": "myindex",
            "_type": "Something",
            "_id": "asd92jsdfodsfsf",
            "_score": 1,
            "_source": {
               "field": "Contains a http://link.to.somewhere/ and @someuser someone",
            },
            "highlight": {
               "field": [
                  {
                    "start": 42,
                    "lenth": 8
                  }
               ]
            }
         },

My imaginary return_as would denote the format:

missing or content would be the current behaviour
offsets would return the start and length tuple objects
content_offsets would return both, e.g.

"highlight": {
   "field": [
      {
        "start": 42,
        "length": 8,
        "content": "<em>someuser</em>"
      }
   ]
}

As can be seen due the additional data the format has to changed to from an array of string to an array of objects.

This was just quickly on top of my head. I bet I forgot to think about a ton of other information :-)

arcodergh commented 9 years ago

+1

chongzhe commented 9 years ago

It's been more than a year since the issue is opened... is there any update on this?

nik9000 commented 9 years ago

I just cracked open eclipse to implement it in the experimental highlighter plugin. But now I'm playing with my kids. On May 8, 2015 7:06 PM, "chongzhe" notifications@github.com wrote:

It's been more than a year since the issue is opened... is there any update on this?

— Reply to this email directly or view it on GitHub https://github.com/elastic/elasticsearch/issues/5736#issuecomment-100391225 .

nik9000 commented 9 years ago

And patch proposed: https://gerrit.wikimedia.org/r/#/c/209956

If you are willing to use the plugin then that might be enough for you.

I think @mfn can get what he wants from it using the none fragmenter. One thing, though, is that Elasticsearch limits me to returning text as the result of the highlight request so I can't make fancy json and you'll have to breakout the Splitters.

mfn commented 9 years ago

Wow, thanks for the effort.

Elasticsearch limits me to returning text as the result of the highlight request so I can't make fancy json

Sad to hear, that's a bit of a bummer though. I'm trying to deduce a possible format from the test case result, could it look like this?

0:0-5,18-22:22

{
  "start_offset": 0,
  "end_offset": 22,
  "hits": [
    {
      "start_offset": 0,
      "end_offset": 5,
    },
    {
      "start_offset": 18,
      "end_offset": 22,
    }
  ]
}

23:33-37:37

{
  "start_offset": 23,
  "end_offset": 27,
  "hits": [
    {
      "start_offset": 33,
      "end_offset": 37,
    }
  ]
}

I understand it closely resembles the Snippet class, but some remarks:

I've worked with highlighting and I always found it easier to work with the positive approach, i.e. start_offset + length = end_offset instead of the negative approach of end_offset - start_offset = length, i.e. preferring having the length instead of the end; simply because most string APIs I work with, work with lengths
I think it would be necessary to actually know which the content was which was highlighted, which doesn't seem to be available (see https://github.com/elastic/elasticsearch/issues/5736#issuecomment-98389774 )

Well, nothing really of value to add I guess, still glad someones igniting the fire :-)

icode commented 7 years ago

+1 how to use？not fix？

hugo53 commented 7 years ago

Wonder why this issue still remains Open for more than 3 years. @javanna Any milestone for this helpful feature bro?

nik9000 commented 7 years ago

Highlighting is just not at the top of anyone's list. And highlighting is complicated because there are 4 highlighters so every feature needs to be implemented four times. One day, someone is going to start really caring about highlighting again and will probably go and do this. As you can see from the issue history, highlighting used to be a big deal to me, now I have other things I spend my days working on.

tahirahmad2030 commented 6 years ago

Any updates on this issue?

sebasao commented 6 years ago

+1 this would be very helpful

jimczi commented 6 years ago

cc @elastic/es-search-aggs

ankitsul commented 6 years ago

+1 any updates? We want to avoid running our own custom highlighter on top of ES results to avoid an increase in time complexity. Hence, would be great if ES can provide it out of the box.

nik9000 commented 6 years ago

We talked about it in #29631 but didn't come to a good conclusion. I think we'll spend more time on this soon though.

stefanobranco commented 5 years ago

Any updates on this? Our users could profit a lot from knowing where in a document their hits were generated. We'll probably look into implementing our own solution otherwise, but it's a bit unfortunate if it seems like all the info is already there on elastic side.

fhaase2 commented 4 years ago

Any updates?

dinamic commented 4 years ago

A possible workaround this could be to decorate the text with a text marker, which can later be removed. We use this approach for storing metadata on per-word basis and has so far been decent.

helmersl commented 4 years ago

Any updates on implementing this as a feature in elasticsearch itself? I'm working with the plugin but this means I always have to setup ES manually...

TheGreatestAlan commented 4 years ago

Any updates on this? Can we bump this up in priority?

jsteggink commented 4 years ago

+1

byronvoorbach commented 4 years ago

+1

iamsinghrajat commented 4 years ago

+1

kazykenov commented 4 years ago

+1

DutchDave commented 3 years ago

+1

Artuur-Oerlemans commented 3 years ago

+1

Vineeth-fw commented 3 years ago

+1

NMcCloud commented 3 years ago

+1

kerryjj commented 3 years ago

+1

iremsha commented 3 years ago

+1

ontorder commented 3 years ago

+1

sebpretzer commented 3 years ago

+1

b-zurg commented 3 years ago

+1

rajamurugesan commented 3 years ago

+1

d33vil commented 3 years ago

+1

gcy0926 commented 3 years ago

+1

izogfif commented 2 years ago

+1

edloginova commented 2 years ago

+1

peter-lang-dealogic commented 2 years ago

+1

chkalch commented 2 years ago

+1

yankarinRG commented 2 years ago

+1

amansrivastava17 commented 2 years ago

+1

shierote commented 1 year ago

+1

blenzi commented 1 year ago

+1

jiange17 commented 1 year ago

+1

miroslav-chandler commented 1 year ago

+1

evicentepred commented 1 year ago

+1

It would be nice to have this feature. In our particular use case, we will need postprocessing based on searching the snippet text in the full property field (avoiding possible repetitions, mismatches, etc.). This is especially 'frustrating' when the documentation itself shows that this info is available and even used to calculate the highlighted words. It would be nice to have an optional config parameter to just return this offset info too.

The last part of this highlighting documentation section:

For our example, we have a single passage with the following properties (showing only a subset of the properties here):

Passage:
    startOffset: 147
    endOffset: 189
    score: 3.7158387
    matchStarts: [159, 164]
    matchEnds: [163, 167]
    numMatches: 2
Notice how a passage has a score, calculated using the BM25 scoring formula adapted for passages. Scores allow us to choose the best scoring passages if there are more passages available than the requested by the user number_of_fragments. Scores also let us to sort passages by order: "score" if requested by the user.

As the final step, the unified highlighter will extract from the field’s text a string corresponding to each passage:

"I'll be the only fox in the world for you."
and will format with the tags <em> and </em> all matches in this string using the passages’s matchStarts and matchEnds information:

I'll be the <em>only</em> <em>fox</em> in the world for you.
This kind of formatted strings are the final result of the highlighter returned to the user.

Anyway, thank you for the ES ecosystem. It is just a suggestion for an a priori small change that leads to an old (2014) and very useful feature for many people, as seen in all the previous comments.

legistek commented 1 year ago

+1 Even SQLite can do this.

davidpetrov commented 1 year ago

+1 This feature will be very helpful for all use cases where you want to find the exact highlighted text and in my case for example I need to find the preceding and succeeding n token before each highlight. This will be significantly easier if offsets are return instead of me re parsing with complex logic to find the tags (especially hard for match phrase query with multiple tokens and multiple matches in a single document - also because of the bug - https://github.com/elastic/elasticsearch/issues/29561).

RaulKite commented 7 months ago

+1

floschne commented 6 months ago

+1

elastic / elasticsearch

[Feature Request] : get the offset of a highlighted field #5736