explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.87k stars 4.38k forks source link

Incorrect lemma from lemmatizer #3444

Closed petro-zdebskyi closed 5 years ago

petro-zdebskyi commented 5 years ago

Right: [w.lemma_ for w in nlp('funnier')] -> ['funny']

Wrong: [w.lemma_ for w in nlp('faster')] ->['faster']

I think for word faster lemma should be fast

petro-zdebskyi commented 5 years ago

This is also works in unexpected way: nlp('data')[0].lemma_ -> 'datum'

DuyguA commented 5 years ago

Hellos, Lemma of an adverb is itself %99 of the time. However, some short adverbs can go comparative or superlative just as in your example, fast -> faster. I'll make a small refinement.

Data is plural of datum in Latin. Data is treated plural in grammatically correct sentences. However, in written and spoken English it's used as singular. Honestly I don't know if should mark data as plural, then a catastrophe may arise due to practical issues.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.