clarinsi / classla

CLASSLA Fork of the Official Stanford NLP Python Library for Many Human Languages
https://www.clarin.si/info/k-centre/
Other
38 stars 19 forks source link

Lexicon construction process during tagger training propagates training data noise into the lemmatizer #36

Closed nljubesi closed 1 year ago

nljubesi commented 2 years ago

When using JANES training data together with SST data I experienced drastic (10 points) falls in lemmatization quality.

Inspecting the reasons, I realised that during the tagger training on a combination of JANES and SST data all the noise from the JANES data (function words sometimes lemmatized as title case) were included in the lexicon and rather heavily applied during lemmatization through the lexicon lookup. My assumption is that entries such as na Na ADP would be added last to the lexicon, and during the lookup process, these would somehow be preferred to the previous na na ADP entries.

We need to discuss the lexicon expansion through training data not to soak up all the noise that can always be expected in training data. Maybe one solution might be to ensure that the lookup procedure prefers earlier entries to the later ones.

nljubesi commented 2 years ago

I tried dealing with the issue by simply assuring that the misnamed method load_influectial_lexicon loads just the first hypothesis, which made the lemmatizer hop 5 points up, keeping still 5 points to be hoped for.

For some tokens it seems that the mislemmatized instances (jaz Jaz PRON) are actually in a higher position in the inflectional lexicon than the correctly lemmatized instances (jaz jaz PRON). We need a meeting on how exactly we populate the lexicon in the pos model. We might only need to assure that lexicon entries are ordered by 1. inflectional lexicon occurrence, 2. frequency in the training data, but this is to be checked.

lkrsnik commented 1 year ago

Did you use --inflectional_lexicon_path when training pos model? And if so, did you include Sloleks lexicon from clarin?

The way classla works when generating internal inflectional lexicon is that it uses the lemma with the highest inflectional lexicon occurrence from csv. All lemmas from training data should be behind everything in inflectional lexicon. That is why I am surprised that you are getting jaz Jaz PRON, since this is not in inflectional lexicon at all, whereas jaz jaz PRON is. I might have to look into this further, but would require the data you used and exact lines you used to run training in order to reproduce the error.

We might only need to assure that lexicon entries are ordered by 1. inflectional lexicon occurrence, 2. frequency in the training data, but this is to be checked.

Point 1. is already implemented, and words from inflectional lexicon have priority over those from training data. Point 2. is so far not implemented, but is a nice idea and maybe should have been.

nljubesi commented 1 year ago

While training the tagger we used

python -m classla.models.tagger --save_dir classla-spoken/models/pos/ --save_name baseline+janes --wordvec_file ~/data/clarin.si-embed/embed.sl-token.ft.sg.vec.xz --train_file classla-spoken/conllu/sst+janes-train.conllu --eval_file classla-spoken/conllu/sst-test.conllu --gold_file classla-spoken/conllu/sst-test.conllu --mode train --shorthand sl_ssj --output_file classla-spoken/out-temp/sst-test.baseline+janes.pos.conllu --inflectional_lexicon_path ~/data/morphlex/Sloleks2.0.MTE/sloleks_clarin_2.0-en.ud.tbl

and while training the lemmatizer we used

python -m classla.models.lemmatizer --model_dir classla-spoken/models/lemma/ --model_file baseline+janes --train_file classla-spoken/conllu/sst-train.conllu --eval_file classla-spoken/out/sst-test.baseline+janes.pos.conllu --output_file classla-spoken/out-temp/sst-test.baseline+janes.lemma.conllu --gold_file classla-spoken/conllu/sst-test.conllu --mode train --num_epoch 30 --decay_epoch 20 --pos --pos_model_path classla-spoken/models/pos/baseline+janes

The noise, as described, from the JANES training data ended up in the lexicon and, not only that, became very prominent, lowering lemmatiser performance for 10 points. This is why we used the --pos_model_path classla-spoken/models/pos/baseline pos model in the end, which brought back the lemmatisation performance to expected heights.

The data, as well as full documentation on training and evaluation, of the examples above, is available at this repo: https://github.com/clarinsi/classla-spoken.

nljubesi commented 1 year ago

This issue seems to have been resolved with the recent issues. Testing via https://github.com/clarinsi/classla-spoken has shown that the old JANES training data cannot negatively impact lemmatisation anymore.