The inflectional lexicon is being populated with corpus data with predicted XPOS during training

We populate the lexicon of the lemmatizer model with data both from an inflectional lexicon and the training corpus. However, we use the XPOS predictions in this corpus, which makes sense as we want the training data to be similar to the production environment.

Given the high level of control that is being added for Slovenian, which breaks our simple greedy approach to everything, I guess it is easiest not to include corpus training data in the model's lexicon if there is an inflectional lexicon present. Otherwise, I would keep adding the corpus training data to the model's lexicon.

clarinsi / classla

The inflectional lexicon is being populated with corpus data with predicted XPOS during training #11