We populate the lexicon of the lemmatizer model with data both from an inflectional lexicon and the training corpus. However, we use the XPOS predictions in this corpus, which makes sense as we want the training data to be similar to the production environment.
Given the high level of control that is being added for Slovenian, which breaks our simple greedy approach to everything, I guess it is easiest not to include corpus training data in the model's lexicon if there is an inflectional lexicon present. Otherwise, I would keep adding the corpus training data to the model's lexicon.
We populate the lexicon of the lemmatizer model with data both from an inflectional lexicon and the training corpus. However, we use the XPOS predictions in this corpus, which makes sense as we want the training data to be similar to the production environment.
Given the high level of control that is being added for Slovenian, which breaks our simple greedy approach to everything, I guess it is easiest not to include corpus training data in the model's lexicon if there is an inflectional lexicon present. Otherwise, I would keep adding the corpus training data to the model's lexicon.