explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
723 stars 59 forks source link

ValueError: [E167] Unknown morphological feature: 'Person' for Polish #46

Closed mpsota closed 4 years ago

mpsota commented 4 years ago

I've successfully run spacy-stanza example for english. However I can't get it working with Polish

import stanza
from spacy_stanza import StanzaLanguage
stanza.download('pl')
snlp = stanza.Pipeline(lang='pl') 
nlp = StanzaLanguage(snlp) 
doc = nlp('Proste zdanie') # "Simple sentence"

Above works, however many other fails:

doc = nlp('To jest błąd') # "This is an error"
Traceback (most recent call last):
..s/spacy_stanza/language.py", line 205, in __call__
    doc = Doc(self.vocab, words=words, spaces=spaces).from_array(attrs, array)
  File "doc.pyx", line 830, in spacy.tokens.doc.Doc.from_array
  File "morphology.pyx", line 286, in spacy.morphology.Morphology.assign_tag
  File "morphology.pyx", line 315, in spacy.morphology.Morphology.assign_tag_id
  File "morphology.pyx", line 203, in spacy.morphology.Morphology.add
ValueError: [E167] Unknown morphological feature: 'Person' (2313063860588076218). This can happen if the tagger was trained with a different set of morphological features. If you're using a pretrained model, make sure that your models are up to date:
python -m spacy validate

Is this because there is no "NER" processor for Polish in Stanza? Is there any easy fix to make it working?

adrianeboyd commented 4 years ago

Hmm, this is a problem with the default tag map for Polish. It wasn't validated properly because we never trained a model using this tag set, but the XPOS tags from the stanza model trigger some of the invalid mappings in an intermediate step even though it's only going to override these values with the XPOS and UPOS tags from the stanza model in the end anyway.

I think the simplest workaround is to modify the tag map in the model to remove all the mappings:

import stanza
from spacy_stanza import StanzaLanguage
snlp = stanza.Pipeline(lang='pl')
nlp = StanzaLanguage(snlp)

# remove all mappings
for tag, attrs in nlp.vocab.morphology.tag_map.items():
    nlp.vocab.morphology.tag_map[tag] = {}

doc = nlp('To jest błąd')

You can also make changes in the default tag map (in spacy/lang/pl/tag_map.py) and install spacy from source, but that is probably more work than the solution above.

If you do want to fix the tag map for spacy v2, you need to know that it requires a slightly unusual encoding of Person values (as the strings one/two/three instead of 1/2/3/), but this restriction is going to be removed in spacy v3, so it's not worth putting much effort in it here. I'll try to validate all the tag maps for the next patch release of v2.3 so people don't run into weird behavior like this.

mpsota commented 4 years ago

Thank you. I agree fixing the tag map is not worth putting an effort, the workaround is fine for me!