Offset misalignment in NER using the Stanza tokenizer for French

vitojph commented 4 years ago

Hi everyone,

I just found a problem when trying to analyze a French sentence. When I run the following code:

snlp = stanza.Pipeline(lang="fr", verbose=False)
stanzanlp = StanzaLanguage(snlp)

text = "C'est l'un des grands messages passés par Bruno Le Maire, ce matin sur RTL."
doc = stanzanlp(text)

I get this error:

/home/victor/miniconda3/envs/nlp/lib/python3.7/site-packages/ipykernel_launcher.py:4: UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ["C'", 'est', "l'", 'un', 'de', 'les', 'grands', 'messages', 'passés', 'par', 'Bruno', 'Le', 'Maire', ',', 'ce', 'matin', 'sur', 'RTL.']
Entities: [('Bruno Le Maire', 'PER', 42, 56), ('RTL.', 'ORG', 71, 75)]
  after removing the cwd from sys.path.

Analyzing the same text with the default French model in spaCy, I get almost the same tokens: take a look at the final stop.

doc = spacynlp(text)

for token in doc:
    print(token.text, token.idx)

for ent in doc.ents:
    print(ent.text, ent.label_)

C' 0
est 2
l' 6
un 8
des 11
grands 15
messages 22
passés 31
par 38
Bruno 42
Le 48
Maire 51
, 56
ce 58
matin 61
sur 67
RTL 71
. 74
Bruno Le Maire PER
RTL ORG

Is anyone having the same issues?

adrianeboyd commented 4 years ago

The issue is the multi-word token expansion of des to de les, which throws off the character-based entity spans. A spacy Doc is only able to represent one layer of token segmentation (not both des and de les in the same Doc), so to prioritize the POS tags and dependency annotation, the Doc returned here modifies the original text to use the expanded tokens instead of the original words. (To be clear, this goes against spacy's normal non-destructive tokenization principles, but it makes things simpler for the purposes of this wrapper.)

The entity annotation returned by stanza is based on character offsets in the original text, which can't be aligned with the expanded tokens, at least not without a lot of effort.

We've added some more informative warnings in #27, which should be in the next release (v0.2.3, I think).

bablf commented 2 years ago

Hey I got the same error message when working with spacy/spacy_stanza/CoreNLP and I found a possible solution. I will post this here since this is the first result when googling the error.

The problem between stanza/CoreNLP and spaCy is the mismatch in tokenization. It's really difficult to map the different tokenizations onto each other. The trick is to call the stanza tokenization first (CoreNLPClient), extract the words and the start of each sentence (when working with documents containing several sentences).

Then you can create a spaCy Doc-object and give it to the spaCy pipeline like this nlp(Doc(nlp.vocab, words=words, sent_starts=sent_starts, ents=entities))

I haven't tried this yet but I think you can also extract the entities from the stanza/CoreNLP result and pass them to the Doc object (see above). But you have to create the Spans for the Entities yourself.

Edit: Alternatively you can create rules for the spaCy-tokenizer but that would be really tedious.

bablf commented 2 years ago

My above solution works for most languages (german, english etc.) but when using a language that spacy does not have a vocab for it kind of does not want to do the named entity recognition and other processing steps (see issue #82).

I found another workaround that seems to work just fine. Use CoreNLPClient to tokenize as described before, but this time just join the words and then call the Pipeline like this:

nlp = spacy_stanza.load_pipeline("xx", lang=self.lang,
                                 processors='tokenize, pos, lemma, depparse, ner',
                                 use_gpu=True)
result = nlp(" ".join(words))

AlexanderPoone commented 1 year ago

The issue is the multi-word token expansion of des to de les, which throws off the character-based entity spans. A spacy Doc is only able to represent one layer of token segmentation (not both des and de les in the same Doc), so to prioritize the POS tags and dependency annotation, the Doc returned here modifies the original text to use the expanded tokens instead of the original words. (To be clear, this goes against spacy's normal non-destructive tokenization principles, but it makes things simpler for the purposes of this wrapper.)

The entity annotation returned by stanza is based on character offsets in the original text, which can't be aligned with the expanded tokens, at least not without a lot of effort.

We've added some more informative warnings in #27, which should be in the next release (v0.2.3, I think).

Indeed. du -> de le in French, del -> de el in Spanish, etc!

AlexanderPoone commented 1 year ago

Needs a workaround for Arabic. Still occasionally fails for all 'workarounds' mentioned on Issues.

explosion / spacy-stanza

Offset misalignment in NER using the Stanza tokenizer for French #32