adbar / simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
https://adrien.barbaresi.eu/blog/simple-multilingual-lemmatizer-python.html
MIT License
134 stars 10 forks source link

Greedy option seems inconsistent #97

Open dysby opened 1 year ago

dysby commented 1 year ago

Hi, using your library version: 0.9.1

I found inconsistent behavior when using greedy option. See example below, where I was expecting the lemmatized versions of the text to be equal when we force greedy option.

>>> text_lemmatizer("fire crew", lang="en")
['fire', 'crow']
>>> text_lemmatizer("fire crews", lang="en", greedy=True)
['fire', 'crew']
>>> text_lemmatizer(" ".join(text_lemmatizer("fire crews", lang="en", greedy=True)), lang="en")
['fire', 'crow']

Thanks,

adbar commented 1 year ago

Hi @dysby, good catch!

My guess would be that the results are cached internally, which affects the results of text_lemmatizer(). In any case it is worth looking further into the issue.

dysby commented 1 year ago

I think it has to do with minimum word length in simplemma.py#L495 at latest release 0.9.1.

Not sure if recent code does the same.