NER & Parsing not working for new language

bablf commented 2 years ago

I am currently trying to import stanza's NER and dependency parsing for Arabic into spacy. As mentioned in a different issue, there seems to be an issue with "mwt" and importing the named entities into the spacy object. The "Arabic"-Pipeline is no different from this and I have to deal with the same problem.

To deal with this I thought of the following workaround:

CoreNLP supports tokenization and sentence splitting for Arabic.
Let's call the CoreNLP Server and generate the words and sent_starts from the returned object.
Then we create a Doc:

nlp = spacy_stanza.load_pipeline("xx", lang="ar", processors='tokenize, pos, lemma, depparse, ner',
                                  use_gpu=True, tokenize_pretokenized=True)
doc = Doc(nlp.vocab, words=words, sent_starts=sent_starts)

So far everything works fine.

But once I call the nlp pipeline the returned object does not have entities and has_annotation is also False. There are no error messages. So I don't know what I am doing wrong. But it seems like the stanza pipeline is not even called. It isn't because of tokenize_pretokenized either.

Are there just missing error messages and it is just the same problem as #32 ?

Minimal working example (without entities). Translation is: "I am hungry. I am going home."

import spacy_stanza
stanza.download("ar")
nlp = spacy_stanza.load_pipeline("xx", lang="ar", processors='tokenize, pos, lemma, depparse, ner',
                                  use_gpu=True, tokenize_pretokenized=True)
words = ['انا', 'جائع', '.', 'أنا', 'ذاهب', 'إلى', 'المنزل']
sent_starts = [True, False, False, True, False, False, False]
doc = Doc(nlp.vocab, words=words, sent_starts=sent_starts)
nlp(doc).has_annotation("DEP")

bablf commented 2 years ago

tokenize_pretokenized and calling the pipeline with a Doc seemed to be the issue. See quoted issue for solution. ~~Only sentence splitting does not work.~~ Sentence splitting also works if you follow this input format and add the tokenize_pretokenized option:

'This is token.ization done my way!\nSentence split, too!'

adrianeboyd commented 2 years ago

In case someone comes across this in the future:

The issue is that the whole stanza pipeline is integrated as the tokenizer in the spacy pipeline (which is a bit unexpected) and you're not running the tokenizer when you call:

doc = Doc(words=words)
doc = nlp(doc)

Starting with a text, doc = nlp(text) does this:

doc = nlp.make_doc(text)
doc = nlp(doc)

With tokenize_pretokenized=True (which splits tokens on whitespace instead of running tokenize and mwt) and tokens from another source, you would want this to run the stanza pipeline on the tokens:

doc = nlp(" ".join(words))

explosion / spacy-stanza

NER & Parsing not working for new language #82