Open erickskrauch opened 4 months ago
Thanks for bringing this up 😄 It sounds like you're running into issues with Unicode normalization, which you can read more about here. Generally speaking, Unicode defines a composed character (i.e. a single codepoint) for each accented character, but also individual characters for the base glyph and any composing accent marks, etc. Normalization is a complicated process that requires several large lookup tables, so I chose to omit it from this library to keep bundle size down, etc. I would recommend calling .normalize("NFC")
(docs) on any strings you pass to UtfString functions to ensure characters and any combining marks are composed together into single characters where possible.
I would also welcome a PR that introduces normalization-aware versions of UtfString's functions, provided it doesn't result in a large increase in bundle size.
Hello.
Users of my project spotted an issue that by typing symbols like
Д̌
andӓ̄
, they receive an invalid output. I have started an investigation of the problem and spotted that this library doesn't recognize them as a single character. Not withUtfSting
, nor withUtfVisualString
.I have a dictionary of Cyrillic characters most of which cause this problem: https://github.com/erickskrauch/da-pizda-bot/blob/5957cdf1c0cdc83ceaa39a95fd8cbffe5527ff8d/src/unicode.ts#L11-L15.
The original issue: https://github.com/erickskrauch/da-pizda-bot/issues/18.