Closed yhamoudi closed 10 years ago
Also nouns, with some restrictions: products -> product producters -> producter -/-> product
Key word: stemming
See the external links on the Wikipedia page. For instance, Porter Stemming Algorithm with a lot of existing implementations, but apparently do not support irregular verbs.
Be carefull with this, we want to transform running
into run
, but we want to keep runner
as is.
We do not want to apply this on quotations or in entities (United States
must stay as is, not be transformed in unit state
).
Other key word: lemmatisation It seems to be more appropriate than stemming: it take into account the context, and the output of the algorithm is always a real word (whereas the stem of "computation" is "comput" which is not a real word). Drawback: more difficult task.
Found two tools for lemmatisation.
My favourite is NLTK, because it is written in Python, and seems to have a more important community.
You need to install wordnet
nltk package (with nltk.install()
).
Example with NLTK (from StackOverflow):
>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'
>>> lmtzr.lemmatize('feet')
'foot'
>>> lmtzr.lemmatize('people')
'people'
>>> lmtzr.lemmatize('fantasized','v')
'fantasize'
LemmaGen supports several European languages.
perhaps we should look at the type of each word (noun, verb... output by the POS tagger of stanford parser) before applying lemmatization:
The second request does not work on our UI.
Moreover, the question Who were the presidents of France?
gives François Hollande
on WolframAlpha...
So I don't agree :)
it's a problem with our UI and wolfram alpha. we can hope that some modules could be able to undestrand it (and that the wikidata module will be able to remove "s"...)
if we always lemmatize, we lose all info about plurals
Then we could also hope that some modules could be able to understand verbs in past tense. Who was the president of France?
and Who is the president of France?
have different meanings (the first one is an ambiguous sentence asking for some past president of France, the second one asks for the current president).
Following this idea, we should not perform any lemmatization...
perhaps... But i think it's less important for verbs because plural/temporal/... info contained into the verb often (always?) appears in other parts of the sentence:
For plural for example, the info is often contained into the verb and the subject (=redundancy). If you lemmatize one of the two, you don't lose any info. If you lemmatize the 2, you lose it.
Normalize words (only verbs?).
Ex: