Pauan / rust-dominator

Zero-cost ultra-high-performance declarative DOM library using FRP signals for Rust!
MIT License
967 stars 62 forks source link

Astral plane characters in user input can get mangled on Windows #10

Closed chris-morgan closed 5 years ago

chris-morgan commented 5 years ago

On Windows, character input was traditionally done via UTF-16 code units. The result of operating this way is that any code point that requires a surrogate pair in UTF-16 will come as two user input events instead of one. There’s a newer technique that lets you take an entire code point at once for applications that support it, but it’s a mixed bunch whether IMEs that are sending the input in the first place will operate in that way. I have two ways of entering emoji, for example: Windows 10’s now-built-in emoji picker (Win+.), and my compose key, using WinCompose. The emoji picker is capable of sending both at once, while WinCompose sends the surrogates individually. (I’m not certain about other platforms and their IME techniques; I think they’re probably all safe from this gotcha.)

The result is this: you cannot trust the value of an <input> to be a legal UTF-16 string at any given point; it should (should) eventually be a legal Unicode string, but in the mean time it is just like any DOMString, merely a sequence of legal UTF-16 code units.

This is the effect of this on the todomvc example: when I type 😕 into the #new-todo input via WinCompose, it is actually input as two two input events, one with the high surrogate 0xD83D, and then one with the low surrogate 0xDE15, which together make U+1F615. (With the emoji picker it comes through in one event, so the bug does not occur.)

The first one triggers an input event on the <input> element. The code fetches the value, finds it to be "\ud83d" (in JavaScript terms), then tries to turn it into a UTF-8 string for Rust, and encountering an unmatched surrogate replaces it with the replacement character, . Then, because the binding is two-way, it writes back to the text input.

Then the second logical keystroke is processed, and the low surrogate appended to the , and then through Rust again, and so "\ufffd\ude15" becomes "\ufffd\ufffd".

End result: 😕 became �� because of the combination of two-way bindings and the use of UTF-8 instead of WTF-8.

This particular case can be resolved by killing off the altogether unnecessary and inefficient two-way binding of the value (just read and reset the value at submission time, you have a handle to the DOM node), and leaving the browser to sort it out, but it’s indicative of a broader class of bug that will generally affect few people (not many people use a compose key on Windows), but could be catastrophic for e.g. Chinese users, depending on the IME they’re using.

I think this is the first time I’ve ever come across a thing on the web that didn’t cope with transient unmatched surrogates—it’s not something that’s ever likely to trip you up in JavaScript, but it’s a problem for wasm stuff.

Pauan commented 5 years ago

Thanks for the very detailed bug report!

This seems much deeper than dominator: it has to do with the inherent discrepancy between UTF-16 and UTF-8.

As you say, this is a very broad bug, with a lot of implications. I don't think dominator is the right place to fix this, since it will affect pretty much any wasm app which deals with strings.

Ideally, we would be able to change the string conversion to round-trip correctly. Failing that, perhaps it's best to not use String at all, and instead use raw JS Strings.

Pauan commented 5 years ago

I also want to address this:

This particular case can be resolved by killing off the altogether unnecessary and inefficient two-way binding of the value (just read and reset the value at submission time, you have a handle to the DOM node)

I do not consider it unnecessary. The entire purpose of dominator is to provide a high-level abstraction over the DOM.

For many years I wrote JavaScript apps that use raw DOM nodes, and it is painful. My intention is specifically to avoid that hell.

Being able to work with native Rust types, and knowing that your Rust state is always synchronized with the DOM, is really awesome. Avoiding the DOM as much as possible is really nice. It's the same reason so many VirtualDOM frameworks have become so popular.

Pauan commented 5 years ago

I made some bug reports:

https://github.com/rustwasm/wasm-bindgen/issues/1348 https://github.com/koute/stdweb/issues/331

I'm in the Rust Wasm WG, and I also contribute to wasm-bindgen and stdweb, so I'll be working closely with them to fix this for all Rust programs, not just dominator.

Pauan commented 5 years ago

Thanks for the report on this. Since it's going to be fixed elsewhere (probably in the browsers themself), I'm going to close this.