explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
722 stars 60 forks source link

Access to named entities index at token level #38

Closed capucincapucine closed 4 years ago

capucincapucine commented 4 years ago

I was wondering whether is is possible to access named entities index at token level, for example: "Barack Obama was born in Hawaii." NE = Barack Obama NE_start : 0 NE_end : 2 I'm working on a project and need the start and end index of each named entity of a given sentence ; Spacy does provide entity index at token level (but does not provide named entity recognition at sentence level) while Stanza does provide named entity recognition at sentence level (but does not provide entity index at token level) so I'm not happy with either of them. I was able to somehow work my way through with the id attributes of token objects on Stanza but I'm stuck if named entities are made up of more than one token. Thank you in advance.

ines commented 4 years ago

If you're using spacy-stanza, the named entities predicted by the Stanza model are translated to spaCy's data structures. So entity spans are reflected in the doc.ents and at the token level, just like in spaCy. If you need the token start and end, you could do:

print([(ent.text, ent.start, ent.end) for ent in doc.ents])