explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
726 stars 60 forks source link

Extra spaces causes token mis-alignment #30

Closed maxmealy closed 4 years ago

maxmealy commented 4 years ago

If there are multiple white space characters between tokens, Tokenizer will raise a warning and the entity will not be extracted. It looks like stanza does not treat the extra white space as a token, while spaCy would.

import stanza
from spacy_stanza import StanzaLanguage
snlp = stanza.Pipeline(lang='en')
nlp = StanzaLanguage(snlp)
text = "There  are  two  spaces  between  these  words"
doc = nlp(text)
>>> UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['There', 'are', 'two', 'spaces', 'between', 'these', 'words']
Entities: [('two', 'CARDINAL', 12, 15)]
print(len(doc.ents)) >>> 0