camertron / utfstring

UTF-safe string operations in JavaScript.
MIT License
25 stars 5 forks source link

Combining characters not treated as a single character #20

Open erickskrauch opened 4 months ago

erickskrauch commented 4 months ago

Hello.

Users of my project spotted an issue that by typing symbols like Д̌ and ӓ̄, they receive an invalid output. I have started an investigation of the problem and spotted that this library doesn't recognize them as a single character. Not with UtfSting, nor with UtfVisualString.

I have a dictionary of Cyrillic characters most of which cause this problem: https://github.com/erickskrauch/da-pizda-bot/blob/5957cdf1c0cdc83ceaa39a95fd8cbffe5527ff8d/src/unicode.ts#L11-L15.

The original issue: https://github.com/erickskrauch/da-pizda-bot/issues/18.

camertron commented 4 months ago

Thanks for bringing this up 😄 It sounds like you're running into issues with Unicode normalization, which you can read more about here. Generally speaking, Unicode defines a composed character (i.e. a single codepoint) for each accented character, but also individual characters for the base glyph and any composing accent marks, etc. Normalization is a complicated process that requires several large lookup tables, so I chose to omit it from this library to keep bundle size down, etc. I would recommend calling .normalize("NFC") (docs) on any strings you pass to UtfString functions to ensure characters and any combining marks are composed together into single characters where possible.

I would also welcome a PR that introduces normalization-aware versions of UtfString's functions, provided it doesn't result in a large increase in bundle size.