clarinsi / classla

CLASSLA Fork of the Official Stanford NLP Python Library for Many Human Languages
https://www.clarin.si/info/k-centre/
Other
38 stars 19 forks source link

[SR] Incorrect lemmatization #33

Closed kateabr closed 1 year ago

kateabr commented 2 years ago

Describe the bug Pipeline for the Serbian language consistently produces incorrect results when processing certain words, for example "mnom".

To Reproduce The snippet below can be used to reproduce this behavior, output that it generates on my computer is also attached. Additional comparison between the original text and its transliteration was added in order to ensure that transliteration is not the cause.

import cyrtranslit
import classla

classla.download('sr')
sr_nlp = classla.Pipeline('sr')

print("check started\n- - - - -")

lat = list(sr_nlp("Za mnom, čitaoče moj, i samo za mnom, a ja ću ti pokazati takvu ljubav!").iter_tokens())
cyr = list(sr_nlp(cyrtranslit.to_latin("За мном, читаоче мој, и само за мном, а ја ћу ти показати такву љубав!", "sr")).iter_tokens())
for lat_token, cyr_token in zip(lat, cyr):
    for lat_key in lat_token.to_dict()[0].keys():
        if lat_key not in cyr_token.to_dict()[0].keys():
            print(f"missing key")
            break
        if lat_token.to_dict()[0][lat_key] != cyr_token.to_dict()[0][lat_key]:
            print(f"mismatching values")
            break
    for cyr_key in cyr_token.to_dict()[0].keys():
        if cyr_key not in lat_token.to_dict()[0].keys():
            print(f"missing key")
            break
        if cyr_token.to_dict()[0][cyr_key] != lat_token.to_dict()[0][cyr_key]:
            print(f"mismatching values")
            break
    print(lat_token.pretty_print())

print("- - - - -\ncheck finished")

Produced output:

check started
- - - - -
<Token id=1;words=[<Word id=1;text=Za;lemma=za;upos=ADP;xpos=Si;feats=Case=Ins;head=2;deprel=case>]>
<Token id=2;words=[<Word id=2;text=mnom;lemma=mna;upos=PRON;xpos=Px--si;feats=Case=Ins|Number=Sing|Person=3|PronType=Prs;head=4;deprel=obl>]>
<Token id=3;words=[<Word id=3;text=,;lemma=,;upos=PUNCT;xpos=Z;head=2;deprel=punct>]>
<Token id=4;words=[<Word id=4;text=čitaoče;lemma=čitati;upos=VERB;xpos=Vmr1s;feats=Number=Sing|Person=1|Tense=Past|VerbForm=Fin;head=0;deprel=root>]>
<Token id=5;words=[<Word id=5;text=moj;lemma=moj;upos=DET;xpos=Ps1msn;feats=Case=Nom|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs;head=10;deprel=det>]>
<Token id=6;words=[<Word id=6;text=,;lemma=,;upos=PUNCT;xpos=Z;head=10;deprel=punct>]>
<Token id=7;words=[<Word id=7;text=i;lemma=i;upos=CCONJ;xpos=Cc;head=8;deprel=discourse>]>
<Token id=8;words=[<Word id=8;text=samo;lemma=samo;upos=ADV;xpos=Rgp;feats=Degree=Pos;head=10;deprel=advmod>]>
<Token id=9;words=[<Word id=9;text=za;lemma=za;upos=ADP;xpos=Si;feats=Case=Ins;head=10;deprel=case>]>
<Token id=10;words=[<Word id=10;text=mnom;lemma=mna;upos=PRON;xpos=Px--si;feats=Case=Ins|Number=Sing|Person=3|PronType=Prs;head=4;deprel=obl>]>
<Token id=11;words=[<Word id=11;text=,;lemma=,;upos=PUNCT;xpos=Z;head=16;deprel=punct>]>
<Token id=12;words=[<Word id=12;text=a;lemma=a;upos=CCONJ;xpos=Cc;head=16;deprel=cc>]>
<Token id=13;words=[<Word id=13;text=ja;lemma=ja;upos=PRON;xpos=Pp1-sn;feats=Case=Nom|Number=Sing|Person=1|PronType=Prs;head=16;deprel=nsubj>]>
<Token id=14;words=[<Word id=14;text=ću;lemma=hteti;upos=AUX;xpos=Var1s;feats=Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin;head=16;deprel=aux>]>
<Token id=15;words=[<Word id=15;text=ti;lemma=taj;upos=DET;xpos=Pd-mpn;feats=Case=Nom|Gender=Masc|Number=Plur|PronType=Dem;head=16;deprel=nsubj>]>
<Token id=16;words=[<Word id=16;text=pokazati;lemma=pokazati;upos=VERB;xpos=Vmn;feats=VerbForm=Inf;head=4;deprel=conj>]>
<Token id=17;words=[<Word id=17;text=takvu;lemma=takav;upos=DET;xpos=Pd-fsa;feats=Case=Acc|Gender=Fem|Number=Sing|PronType=Dem;head=18;deprel=det>]>
<Token id=18;words=[<Word id=18;text=ljubav;lemma=ljubav;upos=NOUN;xpos=Ncfsa;feats=Case=Acc|Gender=Fem|Number=Sing;head=16;deprel=obj>]>
<Token id=19;words=[<Word id=19;text=!;lemma=!;upos=PUNCT;xpos=Z;head=4;deprel=punct>]>
- - - - -
check finished

Expected behavior In the "mnom" token, lemma should be "ja", person should be 1: <Token id=2;words=[<Word id=2;text=mnom;lemma=mna;upos=PRON;xpos=Px--si;feats=Case=Ins|Number=Sing|Person=3|PronType=Prs;head=4;deprel=obl>]>

Environment:

Additional context Text is processed sentence by sentence, perhaps this is a matter of context not being rich enough? Files are in .xml format (xml version="1.0", encoding="utf-8"), loaded and parsed via the lxml.parse method. Full version of the code that I use to annotate parallel texts can be found here.

nljubesi commented 2 years ago

Hey, very short answer. There seems to be a problem with the Serbian standard model. It will obviously need to be retrained. The non-standard model, however, seems ok. I know that the Croatian standard and non-standard models are also ok, but they prefer the ijekavian lemmatization.

My suggestion would be to investigate the Serbian non-standard model.

We will retrain all models in the near future, and will document properly the data used in the training process. This should all happen before the end of the year.

nljubesi commented 2 years ago

After some more analyses, I have concluded that there is sadly nothing wrong with the Serbian standard model. It is only very poor in direct speech since it is based on newspaper data only (no occurrence of "mnom" in all of training data). In your example, it all went wrong when the tagger did not know to predict the "Pp1-si" tag for "mnom", which then triggered the lemmatizer to come up with a best-possible lemma.

The near-in-the-future standard model will be somehow more similar to the current non-standard model, where also normalised Twitter data will be used to make the model more robust. This is what has been done for Croatian already.

For now, as already said, best bet is to use the "nonstandard" model. Do let us know if you have any problems with its output, especially on this scale!

And thanks again? for reporting!