Closed bablf closed 2 years ago
tokenize_pretokenized
and calling the pipeline with a Doc seemed to be the issue. See quoted issue for solution. Only sentence splitting does not work. Sentence splitting also works if you follow this input format and add the tokenize_pretokenized
option:
In case someone comes across this in the future:
The issue is that the whole stanza pipeline is integrated as the tokenizer in the spacy pipeline (which is a bit unexpected) and you're not running the tokenizer when you call:
doc = Doc(words=words)
doc = nlp(doc)
Starting with a text, doc = nlp(text)
does this:
doc = nlp.make_doc(text)
doc = nlp(doc)
With tokenize_pretokenized=True
(which splits tokens on whitespace instead of running tokenize
and mwt
) and tokens from another source, you would want this to run the stanza pipeline on the tokens:
doc = nlp(" ".join(words))
I am currently trying to import stanza's NER and dependency parsing for Arabic into spacy. As mentioned in a different issue, there seems to be an issue with "mwt" and importing the named entities into the spacy object. The "Arabic"-Pipeline is no different from this and I have to deal with the same problem.
To deal with this I thought of the following workaround:
So far everything works fine.
But once I call the nlp pipeline the returned object does not have entities and
has_annotation
is also False. There are no error messages. So I don't know what I am doing wrong. But it seems like the stanza pipeline is not even called. It isn't because oftokenize_pretokenized
either.Are there just missing error messages and it is just the same problem as #32 ?
Minimal working example (without entities). Translation is: "I am hungry. I am going home."