Open vrindaprabhu opened 7 years ago
Thanks for pointing out. The extended ITRANS standard we defined does not probably have a mapping for this character. I will check this over the weekend.
I wonder how this transliteration compares to open-tamil package.
Anoop would you be publishing this package on python pkg repository? Where are your unittests for this project, I can't seem to find it.
The open-tamil package too has some problems handling the unicodes. You will have to explicitly type out in Tamil to get the best results.Discrepancy I faced is like so -
unicode("தொ","utf-8")
#OUTPUT : u'\u0ba4\u0bc6\u0bbe'
tamil_letter = utf8.get_letters("தொ")
utf_tamil = ''.join(tamil_letter).decode("utf-8")
#OUTPUT : u'\u0ba4\u0bca'
I have used open-tamil package.In both scenarios source of the letters were different i.e. different texts.
@vrindaprabhu - please create a suitable issue and we can address it. Also http://libindic.org/ has interesting code bits.
@vrindaprabhu - I checked on Python3 and Open-Tamil version 0.51, I'm not seeing this issue you report. get_letters() returns just 1 letter as element of list.
Strange. Probably like I mentioned it depends on how "தொ" is written. Even I did not face the issue all the time but only with few particular sentences in the corpus.
@vrindaprabhu - there are unicode normalization issues and these are fixed in version 0.65.
Please find the below code for transliterating from Tamil to English.