Richienb / char-regex

A regex to match any full character, considering weird character ranges.
MIT License
27 stars 4 forks source link

Add support for Telugu #1

Closed otacke closed 2 years ago

otacke commented 2 years ago

Hi!

I think I might have a small contribution to make, but I am not entirely sure that I understand the whole unicode grapheme system.

You regular expression works like a charm for emojis and many other graphemes that consist of multiple code points, but I think the Telugu language code block is missing. If I am not mistaken, all the code points that are based on a dotted circle are used for combining code points for a grapheme - at least adding this block

const comboTelugu = "\\u0c00-\\u0c03\\u0c3e-\\u0c44\\u0c46-\\u0c48\\u0c4a-\\u0c4d\\u0c56-\\u0c56\\u0c62-\\u0c63"

to the comboRangeseems to split Telugu strings correctly.

Would that be something worth a pull request?

Richienb commented 2 years ago

Makes sense, ensure you add tests as well.

otacke commented 2 years ago

Excellent. I know someone who writes Telugu. I'll ask for a complete Telugu alphabet to create a test case.

otacke commented 2 years ago

Okay, have the test running based on a programmatically generated array of all code point combinations that I was told could exist - and everything is working fine for Telugu symbols that are based on grapheme clusters using two unicode code points. I'll just have to figure out how the regular expression might handle those symbols that are based on three unicode code points (that overlap with the ones based on two code points). Won't find the time before next week, however.

otacke commented 2 years ago

Suggestion in https://github.com/Richienb/char-regex/pull/2