Closed venkatarangan closed 3 years ago
Hey, thanks a bunch for the link and the feedback. I'll read up on it later, when I have some more time.
I'm not at all familiar with Tamil. This library is built to implement the default grapheme rules as defined in Unicode Annex 29. Would you say that we fail to do so, or that those rules are insufficient for Tamil? That annex says that
This specification defines default mechanisms; more sophisticated implementations can and should tailor them for particular locales or environments.
If we were to improve this handling, is there some generalized way to do it or would it require locale-aware edge case handling?
First, thanks for taking the time to write the article in medium (where I discovered this project) and the GitHub project that goes with it.
I was playing around with strings using texts in my mother tongue (Tamil, spoken in India & around). The standard Python libraries have the same issues in handling Tamil text as you have mentioned. On a quick try, Grapheme seems to work except in 1 or 2 edge cases.
About 15 years ago, when I faced the same issue, I had written a paper for a conference and submitted code written in .NET, Perl and VB.NET to solve this problem, specific for Tamil. You can check it out here: https://venkatarangan.com/blog/2004/12/counting-letters-in-an-unicode-string/
To best of knowledge, these two Tamil Grantha characters ("\u0B95\u0BCD\u0BB7,\u0BB8\u0BCD\u0BB0\u0BC0") have to be handled as exceptions for Tamil — may be similar ones exist for other non-Latin languages too. Basically, they are counted as two instead of one even in Grapheme.