TakeLab / spacy-udpipe

spaCy + UDPipe
MIT License
159 stars 11 forks source link

Error running model for Romanian #10

Closed lucky-bai closed 4 years ago

lucky-bai commented 4 years ago

When running spacy-udpipe for Romanian, I get the following error:

Traceback (most recent call last):
  File "a.py", line 4, in <module>
    doc = nlp("Ce mai faci?")
  File "/scratch/gobi1/bai/bai-conda/lib/python3.7/site-packages/spacy/language.py", line 427, in __call__
    doc = self.make_doc(text)
  File "/scratch/gobi1/bai/bai-conda/lib/python3.7/site-packages/spacy/language.py", line 453, in make_doc
    return self.tokenizer(text)
  File "/scratch/gobi1/bai/bai-conda/lib/python3.7/site-packages/spacy_udpipe/language.py", line 148, in __call__
    spaces=spaces).from_array(attrs, array)
  File "doc.pyx", line 806, in spacy.tokens.doc.Doc.from_array
  File "morphology.pyx", line 283, in spacy.morphology.Morphology.assign_tag
  File "morphology.pyx", line 312, in spacy.morphology.Morphology.assign_tag_id
  File "morphology.pyx", line 200, in spacy.morphology.Morphology.add
ValueError: [E167] Unknown morphological feature: 'Case' (8245304235865630608). This can happen if the tagger was trained with a different set of morphological features. If you're using a pretrained model, make sure that your models are up to date:
python -m spacy validate

Code to reproduce:

import spacy_udpipe

nlp = spacy_udpipe.load('ro')
doc = nlp("Ce mai faci?")
for token in doc:
  print(token.text, token.pos_)

The model is working for most languages, but Polish fails with the same error.

I am using the latest version of spacy (2.2.3) and spacy-udpipe (0.1.0).

asajatovic commented 4 years ago

Thanks for reporting this. After some code digging, I am confident this happens because of the way the tag maps for Romanian and Polish are defined. For the code snippet you provided, a morphology feature "Case" is extracted from "Pw3--r", an XPOS (Language-specific part-of-speech tag) of the word Ce. As "Case" is not in the supported FEATURES for the Morphology class (see this and this), an exception occurs. The same problem happens again for the word Ce and XPOS values "Person" and "PronType". An equivalent thing occurs for the word faci with XPOS value "Vmip2s" mapping to "Person", which again is not in FEATURES(link). You can access the xpostag attribute if you process the text using the 'raw' UDPipe model (nlp.udpipe(text)).

Since this library is only a wrapper for the UDPipe models and as tag maps are specific to each language, to solve the issue(s), I suggest you update the tag maps for the problematic languages. A good start would be https://spacy.io/usage/adding-languages#tag-map and making sure the tag map features are compliant with the ones defined in spaCy. :smile:

lucky-bai commented 4 years ago

Thanks, I'll take a look. In the meantime, is there nothing that can be done on this project, at least fail more gracefully? For me, I'm only looking to use it as a part-of-speech tagger and I don't need to extract the case markings, but it fails to run at all. Maybe it would be better to ignore unrecognized morphological features rather than crashing.

asajatovic commented 4 years ago

I've enabled a quick fix in #11. After some discussion, I am fairly confident this should remain in a separate branch (as the underlying issue is in spaCy). For now, you can use pip install git+https://github.com/TakeLab/spacy-udpipe.git@feature/soft-morph-fail to install the quick-fix version.

rahonalab commented 4 years ago

Hi! I don't know whether this is related, but I cannot print out morphological features for Italian. I have tried both the standard isdt model and the vit model.

I have also tried tag_map:

>>> nlp = spacy_udpipe.load("it")
>>> for token in nlp("Il bello di questo mestiere è che ti fa crescere."): nlp.vocab.morphology.tag_map[token.tag_]
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'RD'

The function works with other languages, for instance English:

>>> nlp = spacy_udpipe.load("en")
>>> for token in nlp("Dogs are friendly."): nlp.vocab.morphology.tag_map[token.tag_]
... 
{74: 92, 'Number_plur': True}
{74: 100, 'Tense_pres': True, 'VerbForm_fin': True}
{74: 84, 'Degree_pos': True}
{74: 97, 'PunctType_peri': True}

but fails for others too, for instance, Croatian:

>>> nlp = spacy_udpipe.load("hr")
>>> for token in nlp("Magdalena već godinama radi u Državnom Restauratorskom Zavodu."): nlp.vocab.morphology.tag_map[token.tag_]
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'Npfsn'

I am using the latest version of spacy (2.2.3) and spacy-udpipe (0.1.0), both with the soft-morph-fail fix and without.

asajatovic commented 4 years ago

@rahonalab Hi! The reason it does not work is because of the tag map for the Italian language (link). Regarding the tag map for Croatian in spaCy, it doesn't yet exist. Both are inherently related to spaCy and if you want to use morphological features, the tag map for a specific language should be updated in the spaCy repo. For more details see https://spacy.io/usage/adding-languages#tag-map. All of this will be documented with some workarounds in a new spacy-udpipe release which is currently WIP. :smile: Edit: You can now install the latest package version (with the mentioned update ^) directly from the master branch!

rahonalab commented 4 years ago

Hello @asajatovic and hvala :pray: for your quick response :-) As far as I understand, the two Italian models as well as the Croatian one don't have the morphological features, right? The link you sent to me explain how to add the tag map to an existing model, so probably I'd have to write the whole set of morphological features for Italian to get it work. But I thought there was already a set of morphological feature, since the key_error contains something...

asajatovic commented 4 years ago

@rahonalab You are welcome! :) You are right, there already exist morphological features for Italian, however spaCy recently changed the (language-agnostic) values in morphological FEATURES. The keys for TAG_MAP from tag_map.py should map exactly from and to morphological FEATURES. Regarding Italian, you should ideally only update the TAG_MAP, whereas for Croatian it can only be done from scratch (no existing TAG_MAP). Also, the TAG_MAP for a specific language is and should be independent of any model for the same language.

rahonalab commented 4 years ago

Thank you, now I start to understand something :-) The Italian tag_map which is currently employed in the UD model has numbers in place of POS:XPOS

nlp.vocab.morphology.tag_map
{'AP__Gender=Fem|Number=Plur|Poss=Yes|PronType=Prs': {74: 90},

whereas the Italian spacy 2.2.4 has:

(/usr/local/lib/python3.7/site-packages/spacy/lang/it)

TAG_MAP = {
    "AP__Gender=Fem|Number=Plur|Poss=Yes|PronType=Prs": {POS: DET},

I saw your workaround to stop importing the 'wrong' TAG_MAP:

nlp = spacy_udpipe.load("it",ignore_tag_map=True)

Why don't you include an option to automatically import the tagmap from spacy?

asajatovic commented 4 years ago

If available, a language-specific TAG MAP is automatically loaded for every spacy-udpipe andspacy language model. Keep in mind that TAG MAP is defined in spaCy, specifically for each language, and is loaded only from spaCy.

The workaround is simply there to enable proper POS tagging by ignoring morphological features if they are outdated (in other words, if the TAG_MAP values don't exactly match FEATURES values).

I hope this clears the confusion! :)

Edit: Regarding the numbers in place of XPOS:POS, that is fine as this also happens when you load a 'pure' spaCy model.

asajatovic commented 4 years ago

Closing this issue as it is fixed in #12.