adbar / simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
https://adrien.barbaresi.eu/blog/simple-multilingual-lemmatizer-python.html
MIT License
138 stars 12 forks source link

Inaccuracy related to capitalization #93

Open juanjoDiaz opened 1 year ago

juanjoDiaz commented 1 year ago

HI @adbar ,

Using simplemma my team has found multiple odd cases related to capitalization.

When words are fully capitalized, lemmatization doesn't seem correct

>>> simplemma.text_lemmatizer("TRAINING INSTRUCTIONS", "en")
['train', 'INSTRUCTIONS']

I have no idea why TRAINING is in the dictionary but INSTRUCTIONS is not. I don't think that TRAINING should be. And we might want to adjust the dictionary lookup strategy to try full lowercasing the word.

I can do a PR since the change is easier. But I have no idea of how you test if changes in the strategies improve or worsen the accuracy of simplemma. It would be good to get that documented so anyone working on the library can run the tests.

Wdyt?

adbar commented 1 year ago

Hi @juanjoDiaz, thanks for the feedback, that's odd indeed.

Words written in all caps currently remain untouched in case they are acronyms (e.g. BRICS). That being said it is safe to say that a token of len > x is most probably not an acronym and can be lower-cased if the language is in BETTER_LOWER. For English long acronyms are rare, we need to decide on a length limit, I'd say 6 ot 7: https://en.wiktionary.org/wiki/Category:English_acronyms