Open mathiasbynens opened 10 years ago
Yep, unicode issues abound, thanks for the report. I'll see about fixing this.
However, this brings up the question: What is considered to be a "character"? Is it what JS considers a character (surrogates), a unicode code point, a grapheme cluster, or something else? For example, for strings with multiple combining marks, such as the one in the example below, does truncate
attempt to preserve those marks or just slice by code points? It seems to make more sense to include the combining marks, but that makes everything much more complicated since an implementation of the Unicode Text Segmentation algorithm would then be necessary.
// by grapheme cluster
slang.truncate('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞', 2) // 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢'
// or by code point
slang.truncate('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞', 2) // 'Z͑'
As of now, slang considers a character what JS does. I think the most correct version is to consider a "character" when slicing or doing other such operations to be a range of code points (i.e. a grapheme cluster). There is a good case for a library to normalize some unicode issues that JavaScript the language doesn't already account for. However, the practicality of this just in terms of the amount of data that would be needed to download (the unicode tables are huge!) might be dubious. I'd be interested to hear your thoughts on this as well!
Since truncate
results in ‘data loss’ anyhow, it might be acceptable to normalize the input first and then truncate that. Or, if you’re looking for a simpler / less heavyweight solution, just strip all the combining marks before further processing the string.
Either way, it would already be much better to just add support for astral symbols (leaving aside combining marks for now) — and doing so would only take a few lines of code. It would be a major improvement over the current behavior that only deals with UCS-2-like code units IMHO.
Test case for
chop
using an example astral symbol U+1D306:Same issue with
truncate
: it incorrectly counts astral Unicode symbols as two chars.Maybe http://mathiasbynens.be/notes/javascript-unicode helps to fully understand the problem.