Closed mcswell closed 4 years ago
This is a bug in spacy, that it doesn't allow numerical features in the Syntagrus dataset used for training ("Person=1", "Person=2", "Person=3"). I have a version with this tag changed (to "Person=first" etc), that will work with 2.2 branch correctly. I'll prepare and upload it early next week. You can also do this change in the dataset and train it yourself in several hours (see Makefile). Just I'm preparing a version with vectors properly integrated and that should improve resulting POS and DEP quality a little bit.
And btw the latest version is 2.2.4 https://pypi.org/project/spacy/#history :)
Thank you for the quick reply! I don't have a GPU (at least not one that works for ML), so I guess I'll wait until next week.
And I wish I could speak Russian like you do English :-)
On a related topic: I notice that spacy-ru (at least the version I have) converts things that I think are acronyms into their lower case equivalents. For example, СССР becomes ссср. I think acronyms should remain upper case for downstream processing--at least I wouldn't expect the English 'NASA' to be returned by an English lemmatizer as 'nasa'. Of course I don't know Russian...
When I've worked with other languages and I want to avoid lower casing acronyms, I do a regex search that looks for upper case letters after the first letter (since an upper case first letter could just be due to sentence capitalization).
A similar issue happens with tokens that contain token-internal numbers; often these are chemical names, like H2O or O2 (the '2' might be subscripted using something like LaTeX, or it could just be the Unicode subscript '2', U+2082).
The regex (using the regex library, not the re library) to match an upper case letter or digit is: rxAcronym = regex.compile("[\p{Uppercase_letter}|\p{Digit}]") so I do a search on all but the first character of a token; if the search does not find an upper case letter or a digit, then I go ahead and lower-case the token: if not rxAcronym.search(sToken[1:]): sToken = sToken.lower()
Would it make sense to do this for Russian? I suspect it would be done in the spacy-ru code in the file lemmatizer.py, probably in one or more of the places where that code now contains string.lower()
Mike Maxwell
On 5/22/2020 1:22 AM, Yuri Baburov wrote:
This is a bug in spacy, that it doesn't allow numerical features in the Syntagrus dataset used for training ("Person=1", "Person=2", "Person=3"). I have a version with this tag changed (to "Person=first" etc), that will work with 2.2 branch correctly. I'll prepare and upload it early next week. You can also do this change and train it yourself in several hours. Just I'm preparing a version with vectors properly integrated and that should improve resulting POS and DEP quality a little bit.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/buriy/spacy-ru/issues/18#issuecomment-632488216, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBCJTITMSFFMDOF4SWWWOTRSYDYLANCNFSM4NHHOP5Q.
-- Mike Maxwell "I may not remember, but I never forget." --Social Crimes, Jane Stanton Hitchcock
Oh, you're right. SpaCy has some re-capitalization for the lemmas, so I will need to do the same in the Russian version. Thanks for noting, somehow I missed it completely. Please note that in SpaCy this behavior is inconsistent and depends on whether the POS tagger was used, etc. How it works: there's a shape flag in each token (token.shape), which can be Xxx, XXX, xxx and so on, which is then used to restore the capitalization. Only very rare words are capitalized like spaCy -- they will be updated to what shape does display for them.
Will "ru2" work well with version 2.3.0?
I've installed spacy v2.3.0:
>>> spacy.__version__ '2.3.0'
When I load the existing version of ru2 using
nlp = spacy.load(<localFile)
I get a warning that
Model 'ru_model' (0.2) requires spaCy v2.1 and is incompatible with the current spaCy version (2.3.0).
And when I try to use nlp(<RussianSentence>)
, I get the error:
Traceback (most recent call last): File "<stdin>", line 1, in
File "/usr/local/lib/python3.8/dist-packages/spacy/language.py", line 446, in call doc = proc(doc, **component_cfg.get(name, {})) File "pipes.pyx", line 398, in spacy.pipeline.pipes.Tagger.call File "pipes.pyx", line 443, in spacy.pipeline.pipes.Tagger.set_annotations File "morphology.pyx", line 315, in spacy.morphology.Morphology.assign_tag_id File "morphology.pyx", line 203, in spacy.morphology.Morphology.add ValueError: [E167] Unknown morphological feature: 'Person' (2313063860588076218). This can happen if the tagger was trained with a different set of morphological features. If you're using a pretrained model, make sure that your models are up to date: python -m spacy validate
So it looks like the answer is no.
We'll have a version for Spacy 2.2 and Spacy 2.3 on Monday.
Looking forward to the 2.3 support!
I've just published SynTagRus-based POS & DEP model for 2.3 right now, but a NER and MIT-licensed POS & DEP is going on to be published several days later. https://github.com/buriy/spacy-ru/releases/tag/v2.3_pre1
How to use it: unpack into your project root folder, then
import ru2_syntagrus
ru2_syntagrus.load_ru2('path_to/ru2_syntagrus')
Or you could just use spacy.load('path_to/ru2_syntagrus/')
but then lemmas will be a bit worse.
I have spacy v2.1.9 installed on one machine, and 2.2.3 (the current latest version) on another. I installed spacy-ru on both, but it only runs well on the 2.1.9 machine. On the 2.2.3 machine, when I do the
doc=nlp(s)
step (with s=Russian text), I get the errorI guess I could build
spacy-ru
from source and maybe this would solve the problem, but I'm not sure I'm up to that. What I did instead was to uninstall version 2.2.3 ofspacy
, and install version 2.1.9 in its place, so nowspacy-ru
works on both machines.But I'd rather be using the current version of
spacy
, which I use for a couple other languages as well. (Even better, I'd likespacy-ru
to be immune to version changes inspacy
, but I suppose that's asking a bit much :-).)Is there a (simple) way to make
spacy-ru
compatible with v2.2 ofspacy
?