buriy / spacy-ru

Russian language models for spaCy
MIT License
242 stars 29 forks source link

Incompatible with spacy v2.2.3? #18

Closed mcswell closed 4 years ago

mcswell commented 4 years ago

I have spacy v2.1.9 installed on one machine, and 2.2.3 (the current latest version) on another. I installed spacy-ru on both, but it only runs well on the 2.1.9 machine. On the 2.2.3 machine, when I do the doc=nlp(s) step (with s=Russian text), I get the error

doc=nlp(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib64/python3.6/site-packages/spacy/language.py", line 435, 
in __call__
doc = proc(doc, **component_cfg.get(name, {}))
File "pipes.pyx", line 397, in spacy.pipeline.pipes.Tagger.__call__
File "pipes.pyx", line 442, in spacy.pipeline.pipes.Tagger.set_annotations
File "morphology.pyx", line 312, in spacy.morphology.Morphology.assign_tag_id
File "morphology.pyx", line 200, in spacy.morphology.Morphology.add
ValueError: [E167] Unknown morphological feature: 'Person' (2313063860588076218). 
This can happen if the tagger was trained with a different set of morphological features. 
If you're using a pretrained model, make sure that your models are up to date:
python -m spacy validate

I guess I could build spacy-ru from source and maybe this would solve the problem, but I'm not sure I'm up to that. What I did instead was to uninstall version 2.2.3 of spacy, and install version 2.1.9 in its place, so now spacy-ru works on both machines.

But I'd rather be using the current version of spacy, which I use for a couple other languages as well. (Even better, I'd like spacy-ru to be immune to version changes in spacy, but I suppose that's asking a bit much :-).)

Is there a (simple) way to make spacy-ru compatible with v2.2 of spacy?

buriy commented 4 years ago

This is a bug in spacy, that it doesn't allow numerical features in the Syntagrus dataset used for training ("Person=1", "Person=2", "Person=3"). I have a version with this tag changed (to "Person=first" etc), that will work with 2.2 branch correctly. I'll prepare and upload it early next week. You can also do this change in the dataset and train it yourself in several hours (see Makefile). Just I'm preparing a version with vectors properly integrated and that should improve resulting POS and DEP quality a little bit.

buriy commented 4 years ago

And btw the latest version is 2.2.4 https://pypi.org/project/spacy/#history :)

mcswell commented 4 years ago

Thank you for the quick reply! I don't have a GPU (at least not one that works for ML), so I guess I'll wait until next week.

And I wish I could speak Russian like you do English :-)

mcswell commented 4 years ago

On a related topic: I notice that spacy-ru (at least the version I have) converts things that I think are acronyms into their lower case equivalents. For example, СССР becomes ссср. I think acronyms should remain upper case for downstream processing--at least I wouldn't expect the English 'NASA' to be returned by an English lemmatizer as 'nasa'. Of course I don't know Russian...

When I've worked with other languages and I want to avoid lower casing acronyms, I do a regex search that looks for upper case letters after the first letter (since an upper case first letter could just be due to sentence capitalization).

A similar issue happens with tokens that contain token-internal numbers; often these are chemical names, like H2O or O2 (the '2' might be subscripted using something like LaTeX, or it could just be the Unicode subscript '2', U+2082).

The regex (using the regex library, not the re library) to match an upper case letter or digit is: rxAcronym = regex.compile("[\p{Uppercase_letter}|\p{Digit}]") so I do a search on all but the first character of a token; if the search does not find an upper case letter or a digit, then I go ahead and lower-case the token: if not rxAcronym.search(sToken[1:]): sToken = sToken.lower()

Would it make sense to do this for Russian? I suspect it would be done in the spacy-ru code in the file lemmatizer.py, probably in one or more of the places where that code now contains string.lower()

Mike Maxwell

On 5/22/2020 1:22 AM, Yuri Baburov wrote:

This is a bug in spacy, that it doesn't allow numerical features in the Syntagrus dataset used for training ("Person=1", "Person=2", "Person=3"). I have a version with this tag changed (to "Person=first" etc), that will work with 2.2 branch correctly. I'll prepare and upload it early next week. You can also do this change and train it yourself in several hours. Just I'm preparing a version with vectors properly integrated and that should improve resulting POS and DEP quality a little bit.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/buriy/spacy-ru/issues/18#issuecomment-632488216, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBCJTITMSFFMDOF4SWWWOTRSYDYLANCNFSM4NHHOP5Q.

-- Mike Maxwell "I may not remember, but I never forget." --Social Crimes, Jane Stanton Hitchcock

buriy commented 4 years ago

Oh, you're right. SpaCy has some re-capitalization for the lemmas, so I will need to do the same in the Russian version. Thanks for noting, somehow I missed it completely. Please note that in SpaCy this behavior is inconsistent and depends on whether the POS tagger was used, etc. How it works: there's a shape flag in each token (token.shape), which can be Xxx, XXX, xxx and so on, which is then used to restore the capitalization. Only very rare words are capitalized like spaCy -- they will be updated to what shape does display for them.

lexmosolov commented 4 years ago

Will "ru2" work well with version 2.3.0?

mcswell commented 4 years ago

I've installed spacy v2.3.0: >>> spacy.__version__ '2.3.0' When I load the existing version of ru2 using nlp = spacy.load(<localFile) I get a warning that

Model 'ru_model' (0.2) requires spaCy v2.1 and is incompatible with the current spaCy version (2.3.0).

And when I try to use nlp(<RussianSentence>), I get the error:

Traceback (most recent call last): File "<stdin>", line 1, in File "/usr/local/lib/python3.8/dist-packages/spacy/language.py", line 446, in call doc = proc(doc, **component_cfg.get(name, {})) File "pipes.pyx", line 398, in spacy.pipeline.pipes.Tagger.call File "pipes.pyx", line 443, in spacy.pipeline.pipes.Tagger.set_annotations File "morphology.pyx", line 315, in spacy.morphology.Morphology.assign_tag_id File "morphology.pyx", line 203, in spacy.morphology.Morphology.add ValueError: [E167] Unknown morphological feature: 'Person' (2313063860588076218). This can happen if the tagger was trained with a different set of morphological features. If you're using a pretrained model, make sure that your models are up to date: python -m spacy validate

So it looks like the answer is no.

buriy commented 4 years ago

We'll have a version for Spacy 2.2 and Spacy 2.3 on Monday.

gonzagazzz commented 4 years ago

Looking forward to the 2.3 support!

buriy commented 4 years ago

I've just published SynTagRus-based POS & DEP model for 2.3 right now, but a NER and MIT-licensed POS & DEP is going on to be published several days later. https://github.com/buriy/spacy-ru/releases/tag/v2.3_pre1

How to use it: unpack into your project root folder, then

import ru2_syntagrus
ru2_syntagrus.load_ru2('path_to/ru2_syntagrus')

Or you could just use spacy.load('path_to/ru2_syntagrus/') but then lemmas will be a bit worse.