anoopkunchukuttan / indic_nlp_library

Resources and tools for Indian language Natural Language Processing
http://anoopkunchukuttan.github.io/indic_nlp_library/
MIT License
549 stars 160 forks source link

Transliteration not proper for few characters in Tamil #11

Open vrindaprabhu opened 7 years ago

vrindaprabhu commented 7 years ago

Please find the below code for transliterating from Tamil to English.

from indicnlp.transliterate.unicode_transliterate import ItransTransliterator

input_text = u'ஒன்றுமட்டுமல்லாது'
lang='ta'
input_text = ItransTransliterator.to_itrans(input_text,lang)
print input_text
#OUTPUT : .oऩRumaTTumallAtu

from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
lang='ta'
x=ItransTransliterator.from_itrans(input_text,lang)
print x
#OUTPUT :  ஒனறுமட்டுமல்லாது
anoopkunchukuttan commented 7 years ago

Thanks for pointing out. The extended ITRANS standard we defined does not probably have a mapping for this character. I will check this over the weekend.

arcturusannamalai commented 7 years ago

I wonder how this transliteration compares to open-tamil package.

Anoop would you be publishing this package on python pkg repository? Where are your unittests for this project, I can't seem to find it.

vrindaprabhu commented 7 years ago

The open-tamil package too has some problems handling the unicodes. You will have to explicitly type out in Tamil to get the best results.Discrepancy I faced is like so -

unicode("தொ","utf-8")
#OUTPUT : u'\u0ba4\u0bc6\u0bbe'

tamil_letter = utf8.get_letters("தொ")
utf_tamil = ''.join(tamil_letter).decode("utf-8")
#OUTPUT : u'\u0ba4\u0bca'

I have used open-tamil package.In both scenarios source of the letters were different i.e. different texts.

arcturusannamalai commented 7 years ago

@vrindaprabhu - please create a suitable issue and we can address it. Also http://libindic.org/ has interesting code bits.

arcturusannamalai commented 7 years ago

@vrindaprabhu - I checked on Python3 and Open-Tamil version 0.51, I'm not seeing this issue you report. get_letters() returns just 1 letter as element of list.

vrindaprabhu commented 7 years ago

Strange. Probably like I mentioned it depends on how "தொ" is written. Even I did not face the issue all the time but only with few particular sentences in the corpus.

arcturusannamalai commented 7 years ago

@vrindaprabhu - there are unicode normalization issues and these are fixed in version 0.65.