Improve stemming - Githubissues

yhamoudi commented 10 years ago

Normalize words (only verbs?).

Ex:

died -> die
producted -> product

Ezibenroc commented 10 years ago

Also nouns, with some restrictions: products -> product producters -> producter -/-> product

Ezibenroc commented 10 years ago

Key word: stemming See the external links on the Wikipedia page. For instance, Porter Stemming Algorithm with a lot of existing implementations, but apparently do not support irregular verbs. Be carefull with this, we want to transform running into run, but we want to keep runner as is. We do not want to apply this on quotations or in entities (United States must stay as is, not be transformed in unit state).

Other key word: lemmatisation It seems to be more appropriate than stemming: it take into account the context, and the output of the algorithm is always a real word (whereas the stem of "computation" is "comput" which is not a real word). Drawback: more difficult task.

Found two tools for lemmatisation.

LemmaGen.
NLTK

My favourite is NLTK, because it is written in Python, and seems to have a more important community. You need to install wordnet nltk package (with nltk.install()). Example with NLTK (from StackOverflow):

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'
>>> lmtzr.lemmatize('feet')
'foot'
>>> lmtzr.lemmatize('people')
'people'
>>> lmtzr.lemmatize('fantasized','v')
'fantasize'

LemmaGen supports several European languages.

yhamoudi commented 10 years ago

perhaps we should look at the type of each word (noun, verb... output by the POS tagger of stanford parser) before applying lemmatization:

verb -> ok
noun -> don't "lemmatize". (france,president,?) != (france,presidents,?)
...

Ezibenroc commented 10 years ago

The second request does not work on our UI. Moreover, the question Who were the presidents of France? gives François Hollande on WolframAlpha...

So I don't agree :)

yhamoudi commented 10 years ago

it's a problem with our UI and wolfram alpha. we can hope that some modules could be able to undestrand it (and that the wikidata module will be able to remove "s"...)

if we always lemmatize, we lose all info about plurals

Ezibenroc commented 10 years ago

Then we could also hope that some modules could be able to understand verbs in past tense. Who was the president of France? and Who is the president of France? have different meanings (the first one is an ambiguous sentence asking for some past president of France, the second one asks for the current president). Following this idea, we should not perform any lemmatization...

yhamoudi commented 10 years ago

perhaps... But i think it's less important for verbs because plural/temporal/... info contained into the verb often (always?) appears in other parts of the sentence:

plural -> a plural subject (Who were the presidents of France?)
tense -> a date (Who was the president of France in 1984?)

For plural for example, the info is often contained into the verb and the subject (=redundancy). If you lemmatize one of the two, you don't lose any info. If you lemmatize the 2, you lose it.

Ezibenroc commented 10 years ago

Fixed https://github.com/ProjetPP/PPP-NLP-classical/pull/24

ProjetPP / PPP-QuestionParsing-Grammatical

Improve stemming #23