explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
722 stars 60 forks source link

Streamline behavior when xpos/tag is None #76

Closed BramVanroy closed 2 years ago

BramVanroy commented 2 years ago

We found that the behavior of spacy_stanza is different than the default spaCy behavior.

spaCy

As described here:

The attribute_ruler maps token.tag to token.pos if there is no morphologizer ... and copies token.pos to token.tag if there is no tagger

That seems a sensible approach. In the Spanish model es_core_news_sm, that means that token.pos_ == token.tag_.

spacy-stanza

The behavior in spacy-stanza is different, however, and does not feel right. Here, when xpos is not available (tag_), feats (morph) is used. That is unexpected and different from spaCy. Because the Spanish stanza model es produces None for xpos (tag_), all xpos values become feats (morph) instead of upos (pos) as is the case in spaCy.

Here is an example:

import spacy_stanza

if __name__ == '__main__':
    text = "Además se le pediría a las empresas interesadas en prestar el servicio que se hagan cargo de la señalización y la cartelería que contiene información para los usuarios."

    nlp = nlp = spacy_stanza.load_pipeline("es")
    doc = nlp(text)
    for token in doc:
        print(token.text, token.pos_, token.tag_, token.morph)

The offending line that does that is here.

https://github.com/explosion/spacy-stanza/blob/a87c723a1f45b456c3488e13afb0090362016bf9/spacy_stanza/tokenizer.py#L146

My guess is that this is an oversight and not a deliberate choice. If so, I can do a quick PR and change the line to

tags.append(token.xpos or token.upos or "")
adrianeboyd commented 2 years ago

I don't think the fallback to feats was an oversight because v2 frequently had combined tag+morph tags stored in tag, so this was intended to be similar. And to be clear, spacy itself never copies pos to tag as a default. It's only a convenience thing added to the provided trained pipelines.

But given the morphologizer support in v3 I think it would make sense to change this.

BramVanroy commented 2 years ago

Sure, I don't know the implementation choices that were taken.

I do not expect copying to be the default indeed, but my question was mainly with the discrepancy between what spaCy does if Token.tag is None (use pos) vs what spacy-stanza does (use feats). A colleague was experiencing "strange behavior that they could not explain" (which might not be that strange after all if that was the implementation goal anyway), so here I am. 😄