Closed BramVanroy closed 2 years ago
I don't think the fallback to feats
was an oversight because v2 frequently had combined tag+morph tags stored in tag, so this was intended to be similar. And to be clear, spacy
itself never copies pos
to tag
as a default. It's only a convenience thing added to the provided trained pipelines.
But given the morphologizer support in v3 I think it would make sense to change this.
Sure, I don't know the implementation choices that were taken.
I do not expect copying to be the default indeed, but my question was mainly with the discrepancy between what spaCy does if Token.tag is None (use pos) vs what spacy-stanza does (use feats). A colleague was experiencing "strange behavior that they could not explain" (which might not be that strange after all if that was the implementation goal anyway), so here I am. 😄
We found that the behavior of
spacy_stanza
is different than the default spaCy behavior.spaCy
As described here:
That seems a sensible approach. In the Spanish model
es_core_news_sm
, that means thattoken.pos_ == token.tag_
.spacy-stanza
The behavior in
spacy-stanza
is different, however, and does not feel right. Here, whenxpos
is not available (tag_
),feats
(morph
) is used. That is unexpected and different from spaCy. Because the Spanish stanza modeles
producesNone
forxpos
(tag_
), allxpos
values becomefeats
(morph
) instead ofupos
(pos
) as is the case in spaCy.Here is an example:
The offending line that does that is here.
https://github.com/explosion/spacy-stanza/blob/a87c723a1f45b456c3488e13afb0090362016bf9/spacy_stanza/tokenizer.py#L146
My guess is that this is an oversight and not a deliberate choice. If so, I can do a quick PR and change the line to