UTF-16 surrogate pairs largely unsupported

saurik commented 11 years ago

If you take the character "[ok, so GitHub also does not support this character, but it is the one you can copy/paste from this website: http://www.codetable.net/decimal/119558]" and paste it into an ACE instance, it looks pretty much ok, but it acts as if it were not one, but two characters: you can actually position your cursor halfway through the character, and you are further able to delete either only the first or only the second half. I presume (but might totally be off-base) that this is because this character is too high to be represented as UCS-2 (as in, is a non-BMP character), and thereby uses a UTF-16 surrogate pair; as JavaScript strings are more akin to a C array than a real string, the underlying representation in the string type involves both individual "characters" (really, "UTF-16 code units" in this particular case). Is it possible for characters like this to become better supported?

nightwing commented 11 years ago

Yes, it is possible. Cursor movement seems to be relatively easy to fix, it should only check if consecutive characters are in 0xD800-0xDBFF 0xDC00-0xDFFF, right? and https://github.com/ajaxorg/ace/issues/460 will help with cursor position going wrong.

btw. seems like Sublime and Notepad++ do not support characters like this too. Just curious where are they used?

saurik commented 11 years ago

(Wow, I am having a hard time believing that 8 days have already passed... it feels like yesterday).

http://stackoverflow.com/questions/5567249/what-are-the-most-common-non-bmp-unicode-characters-in-actual-use

The answer from that page seems to come down to "mathematical symbols". Another very common use case is that the various Emoji character sets that are supported by mobile devices (such as Apple's) are mapped to a private use area that is outside the BMP. (Sufficiently common that people have been complaining about JS's non-UTF-16 strings mostly in the context of Emoji.)

http://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane

The various "supplementary planes" (that being a link to the first one, but there are more below) can give an understanding of the kinds of characters that are inaccessible to a pure 16-bit BMP-only text editor ;P. It is largely a ton of rarer-languages (even hieroglyphs) and special symbols; one of the planes has a bunch of characters used for names in Asian languages.

Yes: one only needs check if consecutive characters are in those ranges, AFAIK.

saurik commented 11 years ago

Oh, maybe you are asking what tools support them. Really, quite a lot of things do... ViM and Emacs have absolutely no problem with them, for example. You can also use them just fine in Microsoft Word. That said, it honestly does not surprise me that neither Sublime Text nor Notepad++ support this correctly (these are not what I tend to think of as world-class editors... you might disagree?).

The one case where I have been negatively surprised is that Pages, Apple's Word-alike, doesn't support them correctly, even though most of the rest of Apple's operating system does (random text boxes, etc.)... even TextEdit (the default OS X notepad app) has no problem handling non-BMP characters, yet Pages experiences the same issues with ending up halfway though characters that ACE does.

saurik commented 11 years ago

I went ahead and made a draft of the kinds of changes that would be required to make ACE Unicode-compliant, and even managed to figure out how to use GitHub well enough to cause the commit to appear as if it were attached to this issue ;P. Due to the excessive Unicode-hostile assumptions made by ACE, this patch is large. I have described the situation further in the full commit message, which I have also attached below in this comment for ease of access.

JavaScript strings are not "strings": they are immutable arrays of 16-bit numbers. Much as developers attempting to manipulate "strings" in a language like C need to use the high-level string function mbslen instead of the low-level character array function strlen to handle UTF-8 sequences, developers attempting to manipulate "strings" in JavaScript need to use functions that understand UTF-16 sequences instead of dropping down to low-level JavaScript string functions.

Note: this patch is not "complete", in that only the critical editor-level functionality has been fixed; I have not yet spent the time to fix all of the various plugins or even surrounding features such as search. However, this proof-of-concept handles the core issue I described in the upstream issue #1153, and fixes the problem not just at the level of cursor movements and character updates, but correctly returns multi-unit characters from events and via the ACE API.

nightwing commented 11 years ago

hmm, this doesn't feel quite right. With this any code that interacts with ace and cares about unicode still have to convert its data from array offsets to "string" offset, only to have ace convert it back, since these conversions have to be done in multiple places, many will be missed, and go unnoticed since not many people use those characters. If we keep array column as a js string offset like now, and only modify high level code to take care of unicode everything would be simpler, only cursor movement will be affected.

saurik commented 11 years ago

Another way to look at this is that any surrounding software that is not already taking into account these offsets correctly actually has a bug in it that should be fixed. Here is an example: if I add some functionality somewhere that says "yank the next two code points and move them to the next line", I should be able to just do that with the API of the high-level text editor, and not have to screw around thinking "oh, this text editor is not Unicode compliant, so I need to first get the line, see how may code units my two code points take up, and then maybe pass 3 or 4 to the API instead of 2; also, when I move it to the next line, I need to first get that line so I can pull it apart and find the position I want to insert". All of that logic should be embedded into the text editor, not required to be reimplemented by every single person who wants to use the text editor and build software around it.

I mean, imagine how bad it gets when you have lots of other code involved: let's say you take the text document, send it through another Unicode compliant API, and then send the results to a Unicode compliant server... the server might not even be implemented using UTF-16... it might be using UCS-4, and the document may have been transcoded to UTF-8 for transport... why is my text editor requiring me to understand UTF-16? That should be an implementation detail of this text editor (that it stores all of its data in JavaScript strings that are in turn implemented on top of a 16-bit array with UTF-16 encoding), not something I need to take into account every time I want to use the API. :(

To put it yet another way: why do you feel that "any code that interacts with ace and cares about unicode still have to convert its data from array offsets to 'string' offsets, only to have ace convert it back"? If you write code that interacts with ACE (such as I have) and you care about Unicode you already have "string" (code point) offsets, and the fact that ACE requires you to instead think in terms of UTF-16 code unit offsets means there is an extra conversion. I am now using ACE with this patch on my website, and it has allowed me to get rid of a ton of conversion logic I previously had to have: now I can take the offsets from ACE and send them directly to my server without having to first do manual UTF-16 conversion as the data comes out of ACE broken.

The way I feel you should then think about this is that the server (or whatever other Unicode compliant API you are using) knows the string's length in code points, not UTF-16 code units: if it thinks the string has length 16 and it sends the text to the client (which will involve first encoding it to something for the wire, something which is probably not UTF-16, and then the browser taking that UTF-16 and converting it to UTF-16 for JavaScript; the browser might not even be storing the text for its own use in UTF-16), and later tells it something about code points 5 through 10... why do I suddenly need to add a bunch of logic to my code to deal with "oh, ACE deals with UTF-16 code units, so before I interact with ACE I need to convert those to 7 and 16 by way of some irritating and complex conversion"?

This is actually a critical problem (as in, one where there isn't even a reasonable workaround), as some browsers have an issue where data taken from the DOM and converted into JavaScript doesn't get UTF-16 encoded: instead, out-of-BMP characters are simply converted to the Unicode "this isn't a valid code point" character. This means that if the server thought it had a two code-point string consisting of "A#" (where # is that crazy five-line bar character I linked to in my original description) in some browsers this will turn into a three code-unit JavaScript string object with A and a surrogate pair, and in some browsers it will turn into a two code-unit JavaScript string with A and \uFFFD (the "replacement character" reserved in Unicode for "I can't represent this character, but maybe you don't need it as maybe your font can't render it anyway"). The server not only has no clue that the client needs UTF-16 code units, but now can't even calculate the correct number of UTF-16 code units to supply it even if it wanted to include a UTF-16 encoder because in some browsers it is getting replaced by a single broken code point.

To flesh out that server example further: let's say that there is a document on a webserver, and it is stored in GB 18030, the Chinese version of Unicode, and one of the characters in that file is a name of a person. Someone else, a third party, parses that file, and tells me that on line 17, characters 18-32, there is a URL I need to underline. When that person downloaded and parsed the document, they are going to be operating on a list of Unicode code points, not on a byte array: they probably don't even know that the document was originally GB 18030, and they have no clue that I'm using a text editor that insists the world only uses characters that fit into UCS-2. When they say characters 18-32, they mean the code points at index 18 through 32 in the high-level Unicode representation of the file. When I get those indexes, I'd love to be able to just send them through as Range objects to my text editor, and that's exactly what the patch I have provided allows: without this kind of patch, I have to convert the offsets before passing them into this API.

Now, you might read that example and think "that's crazy: no one implements things like that", but that's actually how Twitter works: when they return a tweet via their API, they return both the text of the message and a set of "entities",which are pre-parsed sections of the string such as URLs, user-mentions, and hash-tags. They parse this ahead of time and return it so that every client can represent the message in the same way (so you don't have some clients thinking a URL inside of a set of parentheses accidentally includes the trailing parenthesis while others do not). To do this, they take the Tweet, which was typed by a user into a box in their web browser and stored by their operating system in memory as their preferred default encoding (maybe GB 18030) and send it to the server via AJAX or POST; on the way to the server, the browser transcodes the information to UTF-8 (possibly also URL-encoded), where they then decode it into a high-level string on the server for parsing.

Of course, they don't know what encoding your client (or other server) uses: it could be written in Java (with an irritating UTF-16 API like JavaScript), PHP (where you are stuck with byte arrays and a UTF-8 library), Python (where you have a fully-fledged "unicode" type that is encoding-agnostic), or even Ruby (where every string can specify its own encoding, and the API abstracts on top), so when they return offsets into the string they certainly return code-point offsets, not UTF-16 code-unit offsets. This is the dream: a world where all our software, no matter how implemented, can understand the same indices; the goal of my patch is to help ACE play along (not just cursor movement, but the entire API from how events are returned to how Range objects are used). ;P

FWIW, I'd totally accept if you said "we think code points is too low-level of a concept, and would rather figure out how to make our API think all the way up at the character-level"; however, the code-unit level is way way too low-level as it is encoding-specific and makes it infuriating to interoperate with other libraries and APIs, especially when those components are implemented on servers in languages other than JavaScript... if you are forced to think about encodings, you are normally forced to think about UTF-8, not UTF-16. (I am also willing to believe I'm just misunderstanding your concern, but if you are just affecting cursor movement it doesn't fix all of the other things, like the aforementioned column offsets returned by event objects and that are used in Range objects).

nightwing commented 11 years ago

I didn't think about communication with non js servers. But still most of the code interacting with ace will be js, and it will have this bug in foreseeable future. My concern is that with this patch implementation detail of String class is spread all over the editor. And there are many more places that need to change. Most of these code does not need to do anything special for unicode, it just uses string indexOf and regular expressions so if document keeps ranges same as underlying string everything will work fine, but if document uses special string class, they all have to use that same string class. And having UnicodeString.foo(str) instead of str.foo doesn't make code more readable.

I think since we can't fix the bug at the lowest level, next best place for putting workaround is the interface between js and parts that understand unicode (user, non js server). Maybe a special api that uses unicode strings (insertW, replaceW, on("changeW") etc) and util for converting between encodings could help?

wcandillon commented 11 years ago

I'm experiencing the same issue with the XQuery mode. Unicode characters can be inserted nicely in the editor but the cursor position becomes faulty.

tdsmith commented 7 years ago

I would love to see this issue revisited! Emoji have become widespread over the last four years and it would be great if Ace had support.

This issue affects RStudio: https://support.rstudio.com/hc/en-us/community/posts/245294067-Emoji-corrupt-editor-state

github-actions[bot] commented 2 years ago

This issue has received a significant amount of attention so we are automatically upgrading its priority. A member of the community will see the re-prioritization and provide an update on the issue.

ajaxorg / ace

UTF-16 surrogate pairs largely unsupported #1153