Edge cases for Tamil Language to be handled

alvinlindstam / grapheme

A python package for grapheme aware string handling

MIT License

108 stars 7 forks source link

First, thanks for taking the time to write the article in medium (where I discovered this project) and the GitHub project that goes with it.

I was playing around with strings using texts in my mother tongue (Tamil, spoken in India & around). The standard Python libraries have the same issues in handling Tamil text as you have mentioned. On a quick try, Grapheme seems to work except in 1 or 2 edge cases.

About 15 years ago, when I faced the same issue, I had written a paper for a conference and submitted code written in .NET, Perl and VB.NET to solve this problem, specific for Tamil. You can check it out here: https://venkatarangan.com/blog/2004/12/counting-letters-in-an-unicode-string/

To best of knowledge, these two Tamil Grantha characters ("\u0B95\u0BCD\u0BB7,\u0BB8\u0BCD\u0BB0\u0BC0") have to be handled as exceptions for Tamil — may be similar ones exist for other non-Latin languages too. Basically, they are counted as two instead of one even in Grapheme.

alvinlindstam / grapheme

Edge cases for Tamil Language to be handled #8