tamil regex character classes match unintended characters

'^[சிகு]' is the intended expression for lines that starts with either 'சி' or 'கு' just like how in English '^[ab]' matches lines that start with either 'a' or 'b'

But since Unicode represents some of the eastern languages with multiple code points '^[ச,ி,க,ு]' (using the commas for clarity) சி -> ச,ி and கு -> க,ு

Running the expression over few words in python, gives the following results (you can see the full results here).

Note: expected results can be obtained by using this expression '^(சி|கு)' but this works for this specific case, but there should be a way to write expressions to match சிசிசிகுகுசிகு?

regex in tamil is not python issue. it is unicode issue.

Personal opinion: as I continue to work with Tamil unicode, I keep thinking that we should default to TACE16 encoding.

Matches

Ezhil-Language-Foundation / open-tamil

tamil regex character classes match unintended characters #228