aboSamoor / polyglot

Multilingual text (NLP) processing toolkit
http://polyglot-nlp.com
Other
2.31k stars 337 forks source link

Can not reconstruct possition in original sentence from Chunk.start #101

Open vlejd opened 7 years ago

vlejd commented 7 years ago

I want to highlight all entities in raw text. I Use

from polyglot.text import Text
text = "Some not-so-well formated, text -with John Smith."
entities = Text(text).entities 
smith = entities[0]

This correctly finds John Smith, but smith.start is set to 11. How am I supposed to translate it into a position in original text? For nicer texts it is an index of token with mention. For texts with other non letter characters it is something strange.
Maybe change it to index of first character in original sentence.

ramonankersmit commented 6 years ago

Any progress on this issue? I'm also looking for a correct way of retrieving the start end stop index of an entity within the original "raw" string. Any suggestions how to do that?

ramonankersmit commented 6 years ago

Think I have found an answer (maybe not the best one but it works now)

    ptext1 = Text(text1)
    prevIndex = 0
    for sent in ptext1.sentences:
    for entity in sent.entities:
        print(entity.tag, entity, entity.start, entity.end)
        currentIndex = ptext1.index(entity[0], prevIndex)
        print('startindex={}, endindex={}'.format(currentIndex, currentIndex+len(entity[0])))
        prevIndex = currentIndex+len(entity[0])

This will provide the start index and end index of an entity within the original string.