bjascob / LemmInflect

A python module for English lemmatization and inflection.
MIT License
258 stars 25 forks source link

Ability to find base word without knowing the POS tag #16

Closed bjascob closed 1 year ago

bjascob commented 1 year ago

There are some use cases where users would like to find the base word (aka lemma) but don't know what part-of-speech the word is. This is problematic for words like "painting" which could either be "paint" for a verb or "painting" for a noun. Regardless, it may be useful to simply return "paint" for use in Neural Network sentence classification, etc..

Proposed approach is to use the dictionary to find the shortest word. If the word is not in the dictionary then try OOV for Nouns and Verbs and choose the shortest.

bjascob commented 1 year ago

There are a number of situations which have no "correct" answer and need to be handled...

  1. getAllLemmas() could returns multiple lemmas of different spellings but the same length, for different upos types (aka no clear way to determine which to choose)
  2. Using getAllLemmasOOV() requires a upos type, so this function would need to be repeatedly called for all possible lemmatizing values (NOUN, VERB, ADJ, ADV)
  3. In addition, getAllLemmasOOV() may return poorly formed words when trying to lemmatize using the wrong upos type. This means the function would likely inject a number of clearly incorrect words when the original word may simply have been misspelled.

Due to the possibility of completely incorrect words when using OOV rules, it's probably best not to use that function at all when we don't know the upos.

For those that wish the ability to lemmatize without knowing the upos, it's easy enough for them to call getAllLemmas() and then filter the returned dictionary based on the specifc logic they wish to apply.

Given that there is already a simple function for the reasonable case of dictionary only look-up, I won't implement a new function here.