Ezhil-Language-Foundation / open-tamil

Open Source Tamil NLP Tools - தமிழ் இயற்கை மொழி பகுப்பாய்வு நிரல்தொகுப்பு
http://tamilpesu.us
MIT License
262 stars 80 forks source link

tamil regex character classes match unintended characters #228

Closed vanangamudi closed 3 years ago

vanangamudi commented 3 years ago

'^[சிகு]' is the intended expression for lines that starts with either 'சி' or 'கு' just like how in English '^[ab]' matches lines that start with either 'a' or 'b'

But since Unicode represents some of the eastern languages with multiple code points '^[ச,ி,க,ு]' (using the commas for clarity) சி -> ச,ி and கு -> க,ு

Running the expression over few words in python, gives the following results (you can see the full results here).

Note: expected results can be obtained by using this expression '^(சி|கு)' but this works for this specific case, but there should be a way to write expressions to match சிசிசிகுகுசிகு?

regex in tamil is not python issue. it is unicode issue.

Personal opinion: as I continue to work with Tamil unicode, I keep thinking that we should default to TACE16 encoding.

Matches

arcturusannamalai commented 3 years ago

Solution to this problem is possible by tamil.regexp module as illustrated in the tests/tamil_regexp.py test suite.