'^[சிகு]' is the intended expression for lines that starts with either 'சி' or 'கு' just like how in English '^[ab]' matches lines that start with either 'a' or 'b'
But since Unicode represents some of the eastern languages with multiple code points '^[ச,ி,க,ு]' (using the commas for clarity) சி -> ச,ி and கு -> க,ு
Running the expression over few words in python, gives the following results (you can see the full results here).
Note: expected results can be obtained by using this expression '^(சி|கு)' but this works for this specific case, but there should be a way to write expressions to match சிசிசிகுகுசிகு?
regex in tamil is not python issue. it is unicode issue.
Personal opinion: as I continue to work with Tamil unicode, I keep thinking that we should default to TACE16 encoding.
'^[சிகு]' is the intended expression for lines that starts with either 'சி' or 'கு' just like how in English '^[ab]' matches lines that start with either 'a' or 'b'
But since Unicode represents some of the eastern languages with multiple code points '^[ச,ி,க,ு]' (using the commas for clarity) சி -> ச,ி and கு -> க,ு
Running the expression over few words in python, gives the following results (you can see the full results here).
Note: expected results can be obtained by using this expression '^(சி|கு)' but this works for this specific case, but there should be a way to write expressions to match சிசிசிகுகுசிகு?
regex in tamil is not python issue. it is unicode issue.
Personal opinion: as I continue to work with Tamil unicode, I keep thinking that we should default to TACE16 encoding.