Inaccuracy related to capitalization

adbar / simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

MIT License

138 stars 12 forks source link

HI @adbar ,

Using simplemma my team has found multiple odd cases related to capitalization.

When words are fully capitalized, lemmatization doesn't seem correct

>>> simplemma.text_lemmatizer("TRAINING INSTRUCTIONS", "en")
['train', 'INSTRUCTIONS']

I have no idea why TRAINING is in the dictionary but INSTRUCTIONS is not. I don't think that TRAINING should be. And we might want to adjust the dictionary lookup strategy to try full lowercasing the word.

I can do a PR since the change is easier. But I have no idea of how you test if changes in the strategies improve or worsen the accuracy of simplemma. It would be good to get that documented so anyone working on the library can run the tests.

Wdyt?

adbar / simplemma

Inaccuracy related to capitalization #93