clarinsi / classla

CLASSLA Fork of the Official Stanford NLP Python Library for Many Human Languages
https://www.clarin.si/info/k-centre/
Other
38 stars 19 forks source link

The inflectional lexicon is being populated with corpus data with predicted XPOS during training #11

Closed nljubesi closed 1 year ago

nljubesi commented 3 years ago

We populate the lexicon of the lemmatizer model with data both from an inflectional lexicon and the training corpus. However, we use the XPOS predictions in this corpus, which makes sense as we want the training data to be similar to the production environment.

Given the high level of control that is being added for Slovenian, which breaks our simple greedy approach to everything, I guess it is easiest not to include corpus training data in the model's lexicon if there is an inflectional lexicon present. Otherwise, I would keep adding the corpus training data to the model's lexicon.