explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
723 stars 59 forks source link

NER & Parsing not working for new language #82

Closed bablf closed 2 years ago

bablf commented 2 years ago

I am currently trying to import stanza's NER and dependency parsing for Arabic into spacy. As mentioned in a different issue, there seems to be an issue with "mwt" and importing the named entities into the spacy object. The "Arabic"-Pipeline is no different from this and I have to deal with the same problem.

To deal with this I thought of the following workaround:

nlp = spacy_stanza.load_pipeline("xx", lang="ar", processors='tokenize, pos, lemma, depparse, ner',
                                  use_gpu=True, tokenize_pretokenized=True)
doc = Doc(nlp.vocab, words=words, sent_starts=sent_starts)

So far everything works fine.

But once I call the nlp pipeline the returned object does not have entities and has_annotation is also False. There are no error messages. So I don't know what I am doing wrong. But it seems like the stanza pipeline is not even called. It isn't because of tokenize_pretokenized either.

Are there just missing error messages and it is just the same problem as #32 ?

Minimal working example (without entities). Translation is: "I am hungry. I am going home."

import spacy_stanza
stanza.download("ar")
nlp = spacy_stanza.load_pipeline("xx", lang="ar", processors='tokenize, pos, lemma, depparse, ner',
                                  use_gpu=True, tokenize_pretokenized=True)
words = ['انا', 'جائع', '.', 'أنا', 'ذاهب', 'إلى', 'المنزل']
sent_starts = [True, False, False, True, False, False, False]
doc = Doc(nlp.vocab, words=words, sent_starts=sent_starts)
nlp(doc).has_annotation("DEP")
bablf commented 2 years ago

tokenize_pretokenized and calling the pipeline with a Doc seemed to be the issue. See quoted issue for solution. Only sentence splitting does not work. Sentence splitting also works if you follow this input format and add the tokenize_pretokenized option:

'This is token.ization done my way!\nSentence split, too!'

adrianeboyd commented 2 years ago

In case someone comes across this in the future:

The issue is that the whole stanza pipeline is integrated as the tokenizer in the spacy pipeline (which is a bit unexpected) and you're not running the tokenizer when you call:

doc = Doc(words=words)
doc = nlp(doc)

Starting with a text, doc = nlp(text) does this:

doc = nlp.make_doc(text)
doc = nlp(doc)

With tokenize_pretokenized=True (which splits tokens on whitespace instead of running tokenize and mwt) and tokens from another source, you would want this to run the stanza pipeline on the tokens:

doc = nlp(" ".join(words))