explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
722 stars 60 forks source link

Is Spacy given priority over Stanford for same language model? #3

Closed hammad26 closed 5 years ago

hammad26 commented 5 years ago

Using stanfordnlp, Lemma results on an input are:

snlp_en = stanfordnlp.Pipeline(lang="en")
doc = snlp_en("He was a better batsman")
for sentence in doc.sentences:
    for token in sentence.tokens:
        for words in token.words:
            print(words.text, "\t\t", words.lemma)
He       he
was          be
a        a
better  better
batsman     batsman

Now using latest wrapper given by spacy spacy-stanfordnlp and getting following results.

snlp_en = stanfordnlp.Pipeline(lang="en")
nlp_en = StanfordNLPLanguage(snlp_en)
doc = nlp_en("He was a better batsman")
for token in doc:
    print(token.text, "\t\t", token.lemma_)
He       -PRON-
was          be
a        a
better  well
batsman     batsman

So, it look like that Spacy is given priority(you can see for word "he" and "better"). So, when a language has models in both Spacy and Stanford, then how results will be coming? Can you provide a full details how linguistic features will be affected in this case?

ines commented 5 years ago

Thanks for the report! It looks like even though the Doc object is constructed from an array with the lemma, it seems like that lemma gets overwritten internally by the English lookup table (that's shipped with the language data). So this is probably a bug in spaCy.

Maybe this wrapper should have an option to disable using spaCy's underlying language data. It's nice if you want to use stuff like token.like_num, but it can also cause a side-effect like this. If it constructs a blank Language class instead, this wouldn't be happening.

Edit: I've added a separate step to add the lemmas last. Turns out they were automatically overwritten when the POS tags were added, based on spaCy's lemma rules. So we're now setting them afterwards to prevent this and use the predicted lemmas.

So, when a language has models in both Spacy and Stanford, then how results will be coming?

If you're using this wrapper, you're not using spaCy's models so you won't be seeing any of spaCy's predictions. The pipeline will also be empty, so none of spaCy's components that predict something will be run (and they couldn't be run, because there are no model weights loaded).