explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
723 stars 59 forks source link

Fallback to upos instead of to feats #77

Closed BramVanroy closed 2 years ago

BramVanroy commented 2 years ago

This PR makes sure that the fallback for tag_ in spacy-stanza is the same as in spaCy. Currently this is dealt with differently in spacy-stanza compared to spaCy. In spaCy, the attributeruler copies pos to tag in case tag is not defined. However, in spacy_stanza, the morphological features are used instead. That is not desirable, I think. As a user you would expect the same behavior in spaCy and spacy_stanza.

More info and examples in the linked issue https://github.com/explosion/spacy-stanza/issues/76

For reference, spacy-udpipe does not have a fallback, although if you agree with the current PR I may ask TakeLab whether they want to streamline their behavior as well.

Warning: this PR does result in a breaking change, potentially resulting in different Token.tag_ values than before

Added a test for Spanish to illustrate the issue, which currently does not have xpos.

closes #76

BramVanroy commented 2 years ago

The test that is failing is not due to this PR. It fails from the English language dependency labels. Probably due to new stanza models in newer versions. The right labels on stanza 1.2.3 are

['discourse', 'root', 'punct', 'nsubj', 'cop', 'det', 'root', 'punct']

AFAICT stanza models are linked to the stanza version and models may change over patches, e.g. a 1.2.3 model may be different from 1.2.2. I can't test ALL different stanza models, so I am not sure how to deal with this. Pin a very specific stanza version in setup? I'm not a big fan of that.

Maybe we can just use the test data from their repo? Or from a sentence from the UD treebanks.

adrianeboyd commented 2 years ago

Try merging upstream/master into this?

adrianeboyd commented 2 years ago

Sorry, one more thing (and I can do it if you want): can you format the tests with black?

adrianeboyd commented 2 years ago

The tests are more of a sanity check than real unit tests (which is also part of why they're not packaged with the module like in our other packages). Updates here are rare enough that I'll continue to go with the lazy option of just updating the tests as necessary.

BramVanroy commented 2 years ago

That makes sense! Most functionality comes down to copying stanza attributes into spaCy objects. Useful, of course, but not a lot that can go wrong with that.

adrianeboyd commented 2 years ago

Thanks for the PR!