bjascob / LemmInflect

A python module for English lemmatization and inflection.
MIT License
258 stars 25 forks source link

Incorrect inflections of special adjectives like beautiful and handsome #6

Open OscarWang114 opened 3 years ago

OscarWang114 commented 3 years ago

Hi, thanks for building this amazing tool! Currently, it doesn't seem to handle inflections of special adjectives like beautiful and handsome correctly.

Example:

from lemminflect import getLemma, getInflection

lemma = getLemma('beautiful', upos='ADJ')
inflection1 = getInflection(lemma[0], tag='JJR')
inflection2 = getInflection(lemma[0], tag='JJS')
print(inflection1, inflection2)

gives ('beautifuler',) and ('beautifulest',). It'd be great if lemminflect can output something like ('more', 'beautiful',) or ('more beautiful',)!

bjascob commented 3 years ago

Thanks for pointing this out.

What's happening is it doesn't have an inflection in its dictionary for JJR/JJS so it's using the out-of-vocabulary rules to create one. You can see this if you do...

lemminflect.Inflections().getAllInflections(lemma[0])
{'JJ': ('beautiful',)}

Essentially, you're asking it to do something that isn't correct for English and it doesn't know that this isn't allowed, or at least isn't going to try to stop you.

I could probably add a rule prevent it from creating an inflection if it has the base lemma but not the specific inflection (or at least log a warning). However, I'm a little concerned that there might be instances where it only has the base form and falling back to the OOV rules for inflection allow things to work correctly for the user.

The right way to do this would be to have a defined list or set of rules for these exceptions and implement a lookup for them. I can look in the base NIH lexicon to see if there's anything that would with that. If you're aware of any resource that details this behavior, let me know. I'll have to look into this some more.

nihil-admirari commented 1 year ago

At least one exception is handled incorrectly:

In [1]: getInflection('little', 'JJR')
Out[1]: ('littler',)  # should be less

In [2]: getInflection('little', 'JJS')
Out[2]: ('littlest',)  # should be least

Some adjectives don't have comparative or superlative forms at all, not even more/most:

In [3]: getInflection('alphanumeric', 'JJR')
Out[3]: ('alphanumericer',)

In [4]: getInflection('alphanumeric', 'JJS')
Out[4]: ('alphanumericest',)

Simple Wiktionary has a list of them: https://simple.wiktionary.org/wiki/Category:Non-comparable_adjectives; not sure whether it's exhaustive.