Open vitojph opened 4 years ago
The issue is the multi-word token expansion of des
to de les
, which throws off the character-based entity spans. A spacy Doc
is only able to represent one layer of token segmentation (not both des
and de les
in the same Doc
), so to prioritize the POS tags and dependency annotation, the Doc
returned here modifies the original text to use the expanded tokens instead of the original words. (To be clear, this goes against spacy's normal non-destructive tokenization principles, but it makes things simpler for the purposes of this wrapper.)
The entity annotation returned by stanza is based on character offsets in the original text, which can't be aligned with the expanded tokens, at least not without a lot of effort.
We've added some more informative warnings in #27, which should be in the next release (v0.2.3, I think).
Hey I got the same error message when working with spacy/spacy_stanza/CoreNLP and I found a possible solution. I will post this here since this is the first result when googling the error.
The problem between stanza/CoreNLP and spaCy is the mismatch in tokenization. It's really difficult to map the different tokenizations onto each other. The trick is to call the stanza tokenization first (CoreNLPClient), extract the words and the start of each sentence (when working with documents containing several sentences).
Then you can create a spaCy Doc-object and give it to the spaCy pipeline like this nlp(Doc(nlp.vocab, words=words, sent_starts=sent_starts, ents=entities))
I haven't tried this yet but I think you can also extract the entities from the stanza/CoreNLP result and pass them to the Doc object (see above). But you have to create the Spans for the Entities yourself.
Edit: Alternatively you can create rules for the spaCy-tokenizer but that would be really tedious.
My above solution works for most languages (german, english etc.) but when using a language that spacy does not have a vocab for it kind of does not want to do the named entity recognition and other processing steps (see issue #82).
I found another workaround that seems to work just fine. Use CoreNLPClient to tokenize as described before, but this time just join the words and then call the Pipeline like this:
nlp = spacy_stanza.load_pipeline("xx", lang=self.lang,
processors='tokenize, pos, lemma, depparse, ner',
use_gpu=True)
result = nlp(" ".join(words))
The issue is the multi-word token expansion of
des
tode les
, which throws off the character-based entity spans. A spacyDoc
is only able to represent one layer of token segmentation (not bothdes
andde les
in the sameDoc
), so to prioritize the POS tags and dependency annotation, theDoc
returned here modifies the original text to use the expanded tokens instead of the original words. (To be clear, this goes against spacy's normal non-destructive tokenization principles, but it makes things simpler for the purposes of this wrapper.)The entity annotation returned by stanza is based on character offsets in the original text, which can't be aligned with the expanded tokens, at least not without a lot of effort.
We've added some more informative warnings in #27, which should be in the next release (v0.2.3, I think).
Indeed. du -> de le
in French, del -> de el
in Spanish, etc!
Needs a workaround for Arabic. Still occasionally fails for all 'workarounds' mentioned on Issues.
Hi everyone,
I just found a problem when trying to analyze a French sentence. When I run the following code:
I get this error:
Analyzing the same text with the default French model in spaCy, I get almost the same tokens: take a look at the final stop.
Is anyone having the same issues?