explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.61k stars 4.35k forks source link

Russian pos tagging/lemmatization/morphological analysis fails with diacritics #12530

Open mtak- opened 1 year ago

mtak- commented 1 year ago

It seems that while there is support for tokenization with diacritics in spaCy, the project doesn't lemmatize/morph/pos tag correctly when they are used.

How to reproduce the behaviour

import ru_core_news_lg
nlp = ru_core_news_lg.load()
doc = nlp('Я ви́жу му́жа и жену́')
print(doc[-1].pos_) # PROPN (incorrect. just a noun)
print(doc[-1].lemma_) # жену́ (incorrect. should be жена)
print(doc[-1].morph) # nothing is printed which is obviously incorrect

if changed to remove the diacritics all is well

from spacy.lang.char_classes import COMBINING_DIACRITICS
diacritics_re = re.compile(f'[{COMBINING_DIACRITICS}]')
doc = nlp(diacritics_re.sub('', 'Я ви́жу му́жа и жену́'))

print(doc[-1].pos_) # NOUN
print(doc[-1].lemma_) # жена
print(doc[-1].morph) # Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing

pymorphy3/pymorphy2 doesn't handle diacritics

it seems pymorphy3/2 doesn't handle diacritics, so perhaps before parse is called, diacritics should be removed.

diacritics_re = re.compile(f'[{COMBINING_DIACRITICS}]')
text = diacritics_re.sub('', token.text)
adrianeboyd commented 1 year ago

Thanks for the note, we'll take a look!

adrianeboyd commented 1 year ago

The suggestion for the lemmatizer is included in #12554.

For the poor tagging, etc. with statistical models for the tokens with diacritics, I think the best option would be to configure custom NORM, PREFIX, and SUFFIX features for ru and uk that strip diacritics. If you wanted to try this out with the current spacy release (v3.5), you could use a custom language to customize these methods, called lex_attr_getters in the defaults similar to this:

https://spacy.io/usage/linguistic-features#language-subclass

The defaults would be extended similar to this:

https://github.com/explosion/spaCy/blob/8e6a3d58d8fa092eede0fe323441b2aaa3c2042e/spacy/lang/ru/__init__.py#L13-L23

mtak- commented 1 year ago

Wonderful! Thank you for the quick PR and suggestions.

I'm a noob when it comes to spaCy. I'm using it to generate tags on anki flashcards to study Russian. But, if I understand you correctly, the model I use should be trained with diacritics. Is that correct (e.g. ru_core_news_lg will not work)?

I ask because I tried making a custom language and the results were still unsatisfactory (even with a patch similar to #12554).

DIACRITICS_RE = re.compile(f'[{COMBINING_DIACRITICS}]')
def norm(s: str) -> str:
    return DIACRITICS_RE.sub('', s.lower())
def prefix(s: str) -> str:
    return DIACRITICS_RE.sub('', s.lower())[0]
def suffix(s: str) -> str:
    return DIACRITICS_RE.sub('', s.lower())[-3:]
ATTR_GETTERS = spacy.lang.ru.LEX_ATTRS
ATTR_GETTERS.update({
    attrs.NORM: norm,
    attrs.PREFIX: prefix,
    attrs.SUFFIX: suffix,
})

class CustomRussianDefaults(Russian.Defaults):
    lex_attr_getters = ATTR_GETTERS

@spacy.registry.languages("custom_ru")
class CustomRussian(Russian):
    lang = "custom_ru"
    Defaults = CustomRussianDefaults
nlp = ru_core_news_lg.load()
# omitted the patching of _pymorphy_lemmatize
nlp.lang = 'custom_ru'

Test

>>> nlp('Я ви́жу му́жа и жену́')[-1].morph
Animacy=Inan|Case=Acc|Gender=Fem|Number=Sing
>>> nlp('Я вижу мужа и жену')[-1].morph
Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing

The Animacy for жену́ is inanimate with diacritics, which is incorrect.

adrianeboyd commented 1 year ago

The language and language defaults really needs to be set before the pipeline is loaded at all, but you can test this a bit by modifying the pipeline on-the-fly instead. (A few things may already be cached so it might not work 100%.)

nlp = spacy.load("ru_core_news_lg")
nlp.vocab.lex_attr_getters.update(...)

A cleaner version would basically make a copy of ru_core_news_lg where [nlp.lang] is edited to custom_ru. But with the above you should be able to test most things out. And keep in mind that the statistical models will still make mistakes, especially for ambiguous cases.

Vuizur commented 6 months ago

I had the same problem and discovered at least a workaround: One can create two docs, one with the original stressed text, and one with the text with diacritics removed. That way you can iterate through the docs in parallel, getting the correct (stressed) text from doc 1 while getting the grammatical information from doc 2.

It's half as fast, but it does work.