anoopkunchukuttan / indic_nlp_library

Resources and tools for Indian language Natural Language Processing
http://anoopkunchukuttan.github.io/indic_nlp_library/
MIT License
549 stars 160 forks source link

How can i use the morph analyser as the stemmer #5

Closed thak123 closed 8 years ago

thak123 commented 8 years ago

Can you please tell me how i can use the existing morphological analyser in order to get the stems of the words provided as the input to the indic library.

anoopkunchukuttan commented 8 years ago

Hi,

What we have in the Indic NLP library is a word segmenter and not a true morph analyzer, i.e. the library can break a word into its component units. So you will not directly get a stem, but may have to do some post-processing. I can suggest a procedure that may work.

e.g. The Marathi word घरासमोरचा may be broken as घरा समोर चा

Now, you can have the following ways of obtaining the stem:

  1. Take the first string given by the segmenter, in this case घरा. You can see that the root is not provided, but rather the stem -but that may be ok for many purposes. Hpwever, there is a problem with this approach if the stem itself contains multiple morphemes.

e.g. महेश्वराचा may be segmented as महे श्वरा चा. Taking only the first word would be wrong in this case.

  1. A better strategy would be to compile a list of function words llike समोर & चा, and then use it to remove the suffixes from the output of the segmenter. You could automatically compile a suffix list by running the segmenter on a fairly large corpus and then removing the most frequent segments (they are likely to be suffixes)

As for using the segmenter, this documentation should help:

http://nbviewer.ipython.org/url/anoopkunchukuttan.github.io/indic_nlp_library/doc/indic_nlp_examples.ipynb

Hope this helps.

~Anoop

thak123 commented 8 years ago

Thanks for the prompt reply. I'll try and submit the results