Closed lucky-bai closed 4 years ago
Thanks for reporting this. After some code digging, I am confident this happens because of the way the tag maps for Romanian and Polish are defined. For the code snippet you provided, a morphology feature "Case"
is extracted from "Pw3--r", an XPOS (Language-specific part-of-speech tag) of the word Ce
. As "Case"
is not in the supported FEATURES
for the Morphology
class (see this and this), an exception occurs. The same problem happens again for the word Ce
and XPOS values "Person"
and "PronType"
. An equivalent thing occurs for the word faci
with XPOS value "Vmip2s"
mapping to "Person"
, which again is not in FEATURES
(link). You can access the xpostag
attribute if you process the text
using the 'raw' UDPipe model (nlp.udpipe(text)
).
Since this library is only a wrapper for the UDPipe models and as tag maps are specific to each language, to solve the issue(s), I suggest you update the tag maps for the problematic languages. A good start would be https://spacy.io/usage/adding-languages#tag-map and making sure the tag map features are compliant with the ones defined in spaCy. :smile:
Thanks, I'll take a look. In the meantime, is there nothing that can be done on this project, at least fail more gracefully? For me, I'm only looking to use it as a part-of-speech tagger and I don't need to extract the case markings, but it fails to run at all. Maybe it would be better to ignore unrecognized morphological features rather than crashing.
I've enabled a quick fix in #11. After some discussion, I am fairly confident this should remain in a separate branch (as the underlying issue is in spaCy
). For now, you can use
pip install git+https://github.com/TakeLab/spacy-udpipe.git@feature/soft-morph-fail
to install the quick-fix version.
Hi! I don't know whether this is related, but I cannot print out morphological features for Italian. I have tried both the standard isdt model and the vit model.
I have also tried tag_map:
>>> nlp = spacy_udpipe.load("it")
>>> for token in nlp("Il bello di questo mestiere è che ti fa crescere."): nlp.vocab.morphology.tag_map[token.tag_]
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'RD'
The function works with other languages, for instance English:
>>> nlp = spacy_udpipe.load("en")
>>> for token in nlp("Dogs are friendly."): nlp.vocab.morphology.tag_map[token.tag_]
...
{74: 92, 'Number_plur': True}
{74: 100, 'Tense_pres': True, 'VerbForm_fin': True}
{74: 84, 'Degree_pos': True}
{74: 97, 'PunctType_peri': True}
but fails for others too, for instance, Croatian:
>>> nlp = spacy_udpipe.load("hr")
>>> for token in nlp("Magdalena već godinama radi u Državnom Restauratorskom Zavodu."): nlp.vocab.morphology.tag_map[token.tag_]
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'Npfsn'
I am using the latest version of spacy (2.2.3) and spacy-udpipe (0.1.0), both with the soft-morph-fail fix and without.
@rahonalab Hi! The reason it does not work is because of the tag map for the Italian language (link).
Regarding the tag map for Croatian in spaCy, it doesn't yet exist.
Both are inherently related to spaCy and if you want to use morphological features, the tag map for a specific language should be updated in the spaCy repo. For more details see https://spacy.io/usage/adding-languages#tag-map.
All of this will be documented with some workarounds in a new spacy-udpipe
release which is currently WIP. :smile:
Edit: You can now install the latest package version (with the mentioned update ^) directly from the master
branch!
Hello @asajatovic and hvala :pray: for your quick response :-) As far as I understand, the two Italian models as well as the Croatian one don't have the morphological features, right? The link you sent to me explain how to add the tag map to an existing model, so probably I'd have to write the whole set of morphological features for Italian to get it work. But I thought there was already a set of morphological feature, since the key_error contains something...
@rahonalab You are welcome! :)
You are right, there already exist morphological features for Italian, however spaCy recently changed the (language-agnostic) values in morphological FEATURES
. The keys for TAG_MAP
from tag_map.py should map exactly from and to morphological FEATURES
. Regarding Italian, you should ideally only update the TAG_MAP
, whereas for Croatian it can only be done from scratch (no existing TAG_MAP
).
Also, the TAG_MAP
for a specific language is and should be independent of any model for the same language.
Thank you, now I start to understand something :-) The Italian tag_map which is currently employed in the UD model has numbers in place of POS:XPOS
nlp.vocab.morphology.tag_map
{'AP__Gender=Fem|Number=Plur|Poss=Yes|PronType=Prs': {74: 90},
whereas the Italian spacy 2.2.4 has:
(/usr/local/lib/python3.7/site-packages/spacy/lang/it)
TAG_MAP = {
"AP__Gender=Fem|Number=Plur|Poss=Yes|PronType=Prs": {POS: DET},
I saw your workaround to stop importing the 'wrong' TAG_MAP:
nlp = spacy_udpipe.load("it",ignore_tag_map=True)
Why don't you include an option to automatically import the tagmap from spacy?
If available, a language-specific TAG MAP
is automatically loaded for every spacy-udpipe
andspacy
language model. Keep in mind that TAG MAP
is defined in spaCy
, specifically for each language, and is loaded only from spaCy
.
The workaround is simply there to enable proper POS tagging by ignoring morphological features if they are outdated (in other words, if the TAG_MAP
values don't exactly match FEATURES
values).
I hope this clears the confusion! :)
Edit: Regarding the numbers in place of XPOS:POS, that is fine as this also happens when you load a 'pure' spaCy
model.
Closing this issue as it is fixed in #12.
When running spacy-udpipe for Romanian, I get the following error:
Code to reproduce:
The model is working for most languages, but Polish fails with the same error.
I am using the latest version of spacy (2.2.3) and spacy-udpipe (0.1.0).