KaniyamFoundation / ProjectIdeas

A Place to write down the project ideas and to plan them
40 stars 3 forks source link

Write Stemmer for Tamil in python #236

Open tshrinivasan opened 3 days ago

tshrinivasan commented 3 days ago

We need a tool to find the root words of any given tamil word.

for example - > கம்பரிடம் -> கம்பர், சென்னையில் > சென்னை, மரத்தின் -> மரம்

For that, we need the tamil rules for சேர்த்து எழுதுக, பிரித்து எழுதுக.

@rdamodharan has written a algorithm for tamil for snowball stemmer. Open-Tamil python library has a python implementation.

But, it is not a perfect one. check a online demo here. https://mazko.github.io/jssnowball/

Check the algorithm for tamil in c here. https://github.com/snowballstem/snowball/blob/master/algorithms/tamil.sbl

https://github.com/rdamodharan/tamil-stemmer/blob/master/docs/stemmer.png

https://github.com/rdamodharan/tamil-stemmer/

https://mazko.github.io/jssnowball/

What we have to do?

  1. rewrite the stemming algorithm in python for easy understanding
  2. well document it
  3. add more rules
  4. use the nouns from all_tamil_nouns repo as a base. not to stem them further.
  5. once the python version is complete, port the same to snowball, so that it can reach more languages.
tshrinivasan commented 3 days ago

The current tamil stemmer online demo is here- https://tamilpesu.us/en/stemmer/

it is based on https://github.com/rdamodharan/tamil-stemmer/