devongovett / slang

A collection of utility functions for working with strings in JavaScript in the browser or Node
MIT License
170 stars 14 forks source link

Issues with astral Unicode symbols #6

Open mathiasbynens opened 10 years ago

mathiasbynens commented 10 years ago

Test case for chop using an example astral symbol U+1D306:

>> slang.chop('foo\uD834\uDF06'); // U+1D306
'foo\uD834' // expected 'foo' instead

Same issue with truncate: it incorrectly counts astral Unicode symbols as two chars.

>> var string = slang.repeat('\uD834\uDF06', 10); // U+1D306
>> slang.truncate(string, 10) == string
// false, expected true

Maybe http://mathiasbynens.be/notes/javascript-unicode helps to fully understand the problem.

devongovett commented 10 years ago

Yep, unicode issues abound, thanks for the report. I'll see about fixing this.

However, this brings up the question: What is considered to be a "character"? Is it what JS considers a character (surrogates), a unicode code point, a grapheme cluster, or something else? For example, for strings with multiple combining marks, such as the one in the example below, does truncate attempt to preserve those marks or just slice by code points? It seems to make more sense to include the combining marks, but that makes everything much more complicated since an implementation of the Unicode Text Segmentation algorithm would then be necessary.

// by grapheme cluster

slang.truncate('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞', 2)  // 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢'

// or by code point

slang.truncate('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞', 2) // 'Z͑'

As of now, slang considers a character what JS does. I think the most correct version is to consider a "character" when slicing or doing other such operations to be a range of code points (i.e. a grapheme cluster). There is a good case for a library to normalize some unicode issues that JavaScript the language doesn't already account for. However, the practicality of this just in terms of the amount of data that would be needed to download (the unicode tables are huge!) might be dubious. I'd be interested to hear your thoughts on this as well!

mathiasbynens commented 10 years ago

Since truncate results in ‘data loss’ anyhow, it might be acceptable to normalize the input first and then truncate that. Or, if you’re looking for a simpler / less heavyweight solution, just strip all the combining marks before further processing the string.

Either way, it would already be much better to just add support for astral symbols (leaving aside combining marks for now) — and doing so would only take a few lines of code. It would be a major improvement over the current behavior that only deals with UCS-2-like code units IMHO.