`textUtf8` doesn't parse as UTF-8

aduros / wasm4

Build retro games using WebAssembly for a fantasy console.

https://wasm4.org

ISC License

1.15k stars 168 forks source link

`textUtf8` doesn't parse as UTF-8 #509

Closed JerwuQu closed 5 months ago

JerwuQu commented 2 years ago

Comparing https://github.com/aduros/wasm4/blob/493eaec96b8c5e75d8031413e1a312690bff5733/runtimes/web/src/runtime.ts#L197-L200 and https://github.com/aduros/wasm4/blob/493eaec96b8c5e75d8031413e1a312690bff5733/runtimes/web/src/runtime.ts#L278-L282 It seems the TextDecoder was forgotten in the implementation of textUtf8.

This is also true for the native runtime, and for the UTF-16 methods.

EDIT:

There's been a lot of discussion, and SLiV has done a lot of good work on this. With the Rust template fixed, the issue isn't as big in practice anymore.

The best solution right now would probably be to clearly document textUtf8 and textUtf16 everywhere they are used that they are misnomers, but live with that fact. An alternative text8/text8len/text8slice (same for 16) could be added, but we would still need to keep the old versions for compatibility anyways, so it increases the API surface without providing functionality.

SLiV9 commented 2 years ago

As far as I understand, there is an important difference between traceUtf8 and textUtf8. The former outputs text to the console, which expects a UTF8 encoded string. The latter draws the text on the screen using the default font, with a character set that is a custom extension of US-ASCII:

https://github.com/aduros/wasm4/blob/493eaec96b8c5e75d8031413e1a312690bff5733/runtimes/web/src/framebuffer.ts#L289

That said, the current behavior of textUtf8 does seem incorrect. The valid UTF8 encoded string

It costs £20.

is rendered as

It costs Â£20.

because the Unicode code point U+00A3 (£) is encoded in UTF8 as 0xC2 0xA3, which is treated by drawText as two separate ASCII characters, 0xC2 (Â) and 0xA3 (£).

I think a good solution would be to decode the UTF8, iterate over the code points (with [...str]) and map each code point to an ASCII character, with valid-but-unsupported unicode codepoints being treated as blank characters. I could implement something like that after the jam, if no one else does.

Because the current behavior of textUtf8 for Unicode code points higher than 127 (i.e. outside of US-ASCII) is bugged anyway, I don't think it's necessary to deprecate it and add textUtf8_2. Despite Hyrum's Law, I personally cannot imagine anyone actually relying on the insertion of extra Â characters.

JerwuQu commented 2 years ago

As far as I understand, there is an important difference between traceUtf8 and textUtf8. The former outputs text to the console, which expects a UTF8 encoded string. The latter draws the text on the screen using the default font, with a character set that is a custom extension of US-ASCII:

Yep, they have different glyphs, but both should expect UTF-8 encoded data.

valid-but-unsupported unicode codepoints being treated as blank characters

I think it'd be better to show an error symbol in that case.

Because the current behavior of textUtf8 for Unicode code points higher than 127 (i.e. outside of US-ASCII) is bugged anyway, I don't think it's necessary to deprecate it and add textUtf8_2.

Thing is that a lot of games currently use textUtf8, and give it 0x80 etc for icons. Both the Zig and Rust templates use textUtf8 instead of text, for example. If UTF-8 decoding is fixed in the current API, this would break all those games.

To further demonstrate: new TextDecoder().decode(new Uint8Array([0x80])).codePointAt(0) gives 65533 (aka �) instead of 128.

SLiV9 commented 2 years ago

Oh you are totally right. In fact I myself mentioned using

text(unsafe { std::str::from_utf8_unchecked(&[0x80]) }, 20, 20);

in Rust as a workaround for this bug, which (besides being unsafe/undefined behavior) would then break.

In terms of deprecating, there are three levels of compatibility, I think. Given a game developed before today, i.e. with w4 version at most v2.5.3, that uses a workaround like the above to render the buttons:

should the cartridge (mygame.wasm) still render the buttons when run with the patched runtime v2.x.y?
should updating w4 to v2.x.y, but keeping the header (wasm4.zig, wasm4.rs, etc) the same, result in buttons being rendered?
should updating both w4 and the header result in the buttons being rendered?

I think 1. is a definite yes because game developers don't have a chance to intervene. For 2., I could almost imagine w4 bundle --html mygame.wasm failing with an error message, but it would have to detect a call to textUtf8 whose first argument is a byte array with at least one byte above 127. I don't know enough about (parsing) wasm to know where that lies from "simple regex" to "halting problem". But perhaps that idea goes out the window if we consider automatic deployment.

Would it be ok to break at 3. by assuming the developer would test the game and either see a warning message in the terminal or see that the buttons are now rendered incorrectly? Because then we can keep the public API the same and simply change e.g.

https://github.com/aduros/wasm4/blob/6ec52ca37a1ab85371381801311cedf2bd49786b/cli/assets/templates/rust/src/wasm4.rs#L143-L144

#[link_name = "textUtf8_2"]
fn extern_text(text: *const u8, length: usize, x: i32, y: i32);

instead of adding pub fn text_2(...) to wasm4.rs, which would be rather ugly.

I think it'd be better to show an error symbol in that case.

Good point. We would need to add one or two more glyphs to the font. I would suggest

the replacement character � at offset 224 in the font, used when an invalid Unicode codepoint appears or when the input is not valid UTF8.
the ".notdef" glyph 􏿮 (aka tofu) at offset 225 in the font, used when a valid Unicode codepoint has no "WASM-ASCII" equivalent.

In particular, if someone used the above workaround in v2.5.3, decides to update w4 and the header file and then recompiles, then � would appear where the buttons used to be. By contrast, 􏿮's would appear if someone called text("привіт", 20, 20), instead of the mojibake that appears with v2.5.3. The difference might help clarify that drawing button can be done with \u{0080} in Rust instead of 0x80, but drawing characters outside "WASM-ASCII" is not supported at all.

(As a little pre-emptive bikeshed, maybe a crossed box or a box with a question mark would be better for 􏿮, to avoid confusing with a box that somehow has gameplay relevance, such as the one suggested here: https://github.com/aduros/wasm4/issues/195#issuecomment-1120272829.)

To further demonstrate: new TextDecoder().decode(new Uint8Array([0x80])).codePointAt(0) gives 65533 (aka �) instead of 128.

Yeah exactly. In fact this simplifies things: we could use the � character at offset 224 only to render the Unicode character � (at codepoint U+FFFD) if it appears in the output from TextDecoder, which should be if and only if an invalid Unicode codepoint appears or when the input is not valid UTF8.

Edit: I just realized it would be better to add these glyphs at the bottom of the font PNG, at offsets 224 and 225, instead of using using two of the empty slots. That way, it is impossible generate them with the text function (because bytes 0x00-0x1f are not rendered and 0x20-0xFF map to offsets 0-223), which make it more clear what their purpose is, and leaves those slots available for characters that game developers actually want to render.

SLiV9 commented 2 years ago

I've started working on a fix (https://github.com/aduros/wasm4/pull/528) for the web runtime. It still requires a solution for backwards compatibility and a port to the native runtime, but I wanted to get your feedback on it because perhaps that might lead to a different approach.

I've written some test code (https://github.com/SLiV9/wasm4/tree/utf8/examples/utf8) in Rust, Zig, C and AssemblyScript with WASM-4 v2.5.3, and then ran the cart in v2.5.3 (left screenshot) and in a local runtime with the fix applied (right screenshot).

Each test prints "press X to blink" (the main test), "$10/£10/€10", "Aÿ!!!", "하와!!!", "𐍅𐌰𐍃𐌼4!", U+FFFD, U+FFFE and U+FFFF.

"$10/£10/€10" has two supported currencies (US-ASCII dollar and extended ASCII pound sterling) and one unsupported currency (euro). "Aÿ" is the name of a French town (and 0xFF is the highest supported character). "하와!!!" is Korean (unsupported in WASM4). "𐍅𐌰𐍃𐌼4!" is "WASM4" in 4th century Gothic (obviously unsupported, but interesting because its Unicode codepoints are higher than U+FFFF). U+FFFD is the Unicode replacement character (previously unsupported, now rendered as itself). U+FFFE and U+FFFF are explicitly invalid Unicode codepoints.

Rust: wasm4-screenshot_2022-08-23_utf8_rs_before

As intended, this fix allows safe Rust to draw the X button using \u{0080}, but as expected it treats the invalid UTF-8 from the std::str::from_utf8_unchecked hack as invalid UTF-8. Because Rust source code is UTF-8 encoded, drawing the pound sign is now more convenient: just text("£", 20, 20). Unsupported Unicode characters appear as individual tofu blocks instead of mojibake, which is in my eyes an improvement because it is easier to explain in the documentation and the width of the text is correct.

Zig: wasm4-screenshot_2022-08-23_utf8_zig_before

The output in Zig is identical. However, unlike Rust, Zig allows invalid UTF-8 characters in its strings (see https://ziglang.org/documentation/master/#toc-String-Literals-and-Unicode-Code-Point-Literals), which means text("Press \x80", 20, 20) used to work and now doesn't.

An argument could be made that both 0x80 and 0u0080 should be supported in Zig. I'm personally not a fan given that Zig strings are assumed to be UTF-8 encoded, and 0x80 is just not valid UTF-8. It would also raise further questions: if 0x80 is allowed then why not 0xC2 for Â (U+00C2), but then "0xC20x80" could be either ÂX or just X.

I didn't test Odin or Go, but I expect them to behave the same.

C: wasm4-screenshot_2022-08-23_ascii_c_before

C, C++, D, Nelua, Nim, Porth, Roland and WebAssembly Text all call text, not textUtf8, and should not be affected. However text also calls drawText, so I wanted to make sure that nothing changes. Here "Press \x80" still works as normal. Because C assumes strings are ASCII (or rather makes no assumptions, but WASM-4 treats the input to text as ASCII), UTF-8 encoded strings are turned into mojibake at the cart level, so no tofu blocks appear. To draw the supported characters from extended ASCII, 0xNN notation has to be used, as before.

FWIW, I built this example with Zig (zig translate-c) because I could not immediately get WASI-SDK to work. The result should be the same because it calls the runtime function text, which is what I wanted to test. If someone wants to doublecheck my work, you could recompile with make.

[...] for the UTF-16 methods (which I'm not sure why we have in the first place).

The textUtf16 function is used by AssemblyScript, presumably because JS strings are UTF-16 encoded. I haven't yet changed textUtf16, but I was surprised to see the following behavior:

AssemblyScript: wasm4-screenshot_2022-08-23_utf16_assemblyscript_before

Both 0x80 and 0u0080 work because in JS/TS/AS both are interpreted as U+0080 and then stored in the UTF-16 encoded string as 0x0080. This string is turned into a Uint16Array and then each u16 is treated the same way as the u8 from text. For Unicode codepoints U+0020 through U+00FF, which is the subset that corresponds to extended ASCII, this happens to work. Very large Unicode codepoints are not represented in UTF-16 by a single number, which means double tofu blocks appear. This is a bit inconsistent, but because they are so far outside what WASM-4 supports, I'm not sure it is worth it to fix this. In fact, maybe we don't need to touch textUtf16 at all.

I didn't expect the black boxes in the original build, though. At least nothing in drawText seems to explain their appearance. Is this an (un)intended side effect blit when srcY exceeds the boundaries of the source image? If it is well-defined behavior, we could also not add the two new glyphs and use black boxes for textUtf8 as well. It would still be an improvement over the current mojibake.

(Sorry for the wall of text.)

Let me know what you think and whether I should change anything or keep going.

JerwuQu commented 2 years ago

This is some rigorous testing! I think it's on the right track, and the invalid-codepoint characters are a nice addition.

It's a difficult situation.

I still believe in what I said in the original post, which is deprecating the current textUtf8 and all UTF-16 methods, and instead adding textUtf8_2 (and maybe textAscii too, but it's not as important.) Note that by deprecating, I mean removing it from headers and printing a warning in the console when it is imported, but still keeping the implementations for as long as needed. Headers would (where possible) change so that textUtf8 instead imports textUtf8_2 so that new projects use the new implementation.

As for UTF-16, I would personally opt for deprecating WASM-4 support for it completely. AssemblyScript already has support for re-encoding strings to UTF-8 (which could be done automatically in wasm4.ts), and there's even talk about getting UTF-8 strings in the language, because as it turns out, not many people actually like UTF-16 :p I have no idea if the UTF-8 encoding would add considerable overhead to the cart size. @MaxGraey might be able to give some input on this point.

I'm personally not a fan given that Zig strings are assumed to be UTF-8 encoded, and 0x80 is just not valid UTF-8.

Yeah, this is why there could be a textAscii instead, so that you can just use the u8 arrays directly, but there's also not too much pain in just keeping it as proper UTF-8 in Zig either so I don't think it's entirely necessary. I think textUtf8 was used more because text requires a null-terminator, and textUtf8 takes a length instead which makes it work better with Zig slices.

MaxGraey commented 2 years ago

I have no idea if the UTF-8 encoding would add considerable overhead to the cart size.

Yes! String.UTF8.decode + String.UTF8.encode adds + ~1 Kb which I think quite significant

MaxGraey commented 2 years ago

I don't really understand the point of removing textUtf16 from API. Especially if it was introduced from the beginning. Whether you like it or not, languages that use UTF16 will never remove it. Moreover, the new WebAssembly stringref proposal also includes UTF16/WTF16 support as well as UTF8/WTF8.

MaxGraey commented 2 years ago

As I understand, textUtf8 /textUtf16 actually is misleading names. Both methods are accepts only ASCII which use per-byte lookups in FONT bitmap. textUtf8 just accepts u8 raw array while textUtf16 accepts u16 raw array. So it's better to just rename them to text8, text16 without mention Utf (unicode) which don't actually used in wasm4

SLiV9 commented 2 years ago

Note that by deprecating, I mean removing it from headers and printing a warning in the console when it is imported, but still keeping the implementations for as long as needed. Headers would (where possible) change so that textUtf8 instead imports textUtf8_2 so that new projects use the new implementation.

Ok yeah, that makes sense to me. I was worried that game devs would have to switch to using a text2 function for all new games, which would be quite ugly.

As I understand, textUtf8 /textUtf16 actually is misleading names. Both methods are accepts only ASCII which use per-byte lookups in FONT bitmap. textUtf8 just accepts u8 raw array while textUtf16 accepts u16 raw array. So it's better to just rename them to text8, text16 without mention Utf (unicode) which don't actually used in wasm4

This would work, except that strings in some languages (Rust in particular) are explicitly UTF-8 by default, which is incompatible with extended ASCII. (UTF-8 is compatible with US-ASCII because the first 128 Unicode codepoints are all encoded in 1 byte, but not with extensions like the one used by WASM-4.) And creating strings in a different encoding either requires undefined behavior or is much less ergonomic. For AssemblyScript this problem doesn't exist because all 256 extended ASCII characters are numerically equal to their UTF-16 counterparts: 0x80 is 0x0080 in UTF-16 but 0xC2 0x80 in UTF-8.

Basically if Zig support had been added before Rust, the function could have been called text8 or textAscii, but then Rust support would have required adding a different function that does actually parse UTF-8.

Yeah, this is why there could be a textAscii instead, so that you can just use the u8 arrays directly.

So an alternative solution could be:

Add text8 (or textAscii/textAscii8) and text16 methods that copy the v2.5.3 behavior of textUtf8 and textUtf16 respectively.
Make textUtf8 and textUtf16 aliases to text8 and text16, so their behavior stays the same.
Make textUtf8 deprecated.
Add textUtf8_2 based on my new implementation.
Update the Zig template to use text8, the Rust template to use textUtf8_2 and the AssemblyScript template to use text16.

Based on a quick search, Odin and Go also allow for arbitrary bytes / do not enforce UTF-8, so they could have the same treatment as Zig. In fact you wouldn't even need to give a deprecation warning for Zig, Odin and Go, if you could somehow distinguish them from Rust during import.

I guess the question is then, given the choice between

// Team C
w4.text("Hello Ren\xE9, press \x80 to gain \xA310 (\xA92022)", 20, 20);

and

// Team Rust
w4.text("Hello René, press \u{0080} to gain £10 (©2022)", 20, 20);

which do Zig, Go and Odin users prefer?

SLiV9 commented 2 years ago

I've implemented backwards compatibility and ported it to the native runtime. I'm not familiar enough with WebAssembly to figure out how to actually make the old functions deprecated. A trace each time the old function is called could work, but I'm not sure it's actually worth it, given that we can't actually ever remove those functions.

I took the liberty to call the new method textFromUtf8 instead of textUtf8_2, because I think it looks nicer next to the other camel case methods, and because it conveys the meaning that some lossy conversion takes place (from UTF-8 to ASCII). Whereas traceUtf8 actually pipes UTF-8 encoded strings to the terminal.

Here is the native runtime running Rust (calling textFromUtf8) and AssemblyScript (still calling textUtf16):

wasm4-screenshot_2022-08-25_utf8_rust_native_after

There are some tiny differences with the web runtime, but only for unsupported characters. I didn't want to add a new dependency to the native runtime, so I implemented a 20 LoC decoder that only works for UTF-8 encoded ASCII characters. For everything else (valid Unicode or not) it draws replacement characters.

If this seems too inconsistent, we could scrap the tofu block and only use the replacement character in the web runtime as well. However I think it still has value, seeing as the web runtime is the one used during development.

SLiV9 commented 2 years ago

I finished updating the templates and the documentation. I was a bit confused by the mismatched function declarations in the Go and Odin templates, and I wasn't able to test them. (I assume the annotations automatically convert each string argument to a pointer-length pair?) Nor did I individually test the code snippets from text.md, but I did check the language specifications for each to make sure the characters escapes are correct.

This is now done, unless someone has any concerns or suggests a completely different approach, of course.

@aduros I purposefully did not update site/static/img/charset.png, because the two new glyphs do not belong to the 224 that the user should intentionally put on the screen. But if you use a script to generate charset.png from runtimes/web/font.png, then maybe its better to omit them from font.png as well and rework my additions to drawText a little.

MaxGraey commented 2 years ago

To be honest, I don't really see the point of Unicode support for wasm4. While the environment is limited to 64kb cartridge memory, 4-color palettes, 160x160 px screen size and it will be very strange to be able to support Unicode between 0 and 0x10FFFF characters instead of [0, 0x7F] or [0, 0xFF]. We cannot create a BITMAP font that covers at least a small unicode range anyway. At most, we can cover 0-1024 chars which takes full available 64kb space (current font cover only 28 chars and takes 1792 bytes). And 256 chars will take 16kb.

SLiV9 commented 2 years ago

I completely agree! WASM-4 should not support Unicode characters beyond U+00FF. Maybe my test images were a bit confusing.

Despite the name, this issue is not about extending/changing the encoding used by WASM-4 from ASCII to UTF-8. However, strings in some modern languages (in particular Rust) are UTF-8 encoded by definition, and UTF-8 and full ASCII are not byte-compatible. So no matter what, a conversion needs to occur, either by the runtime (textFromUtf8) or by the game developer.

And in my opinion it is worth it to keep the public usage of text extremely simple. That is, it should look something like

text("Hello world!", 20, 20);

in all non-assembly programming languages. This pull request does that.

Fundamentally, this issue is a bugfix and a clarification of the docs, not a feature addition.

At most, we can cover 0-1024 chars which takes full available 64kb space (current font cover only 28 chars and takes 1792 bytes). And 256 chars will take 16kb.

The current font indeed takes up 1792 bytes, but supports 224 characters by using 1BPP. My pull request currently adds 16 bytes to help the user identify encoding errors. We could decide to shave this down to 8 bytes by dropping the tofu, or to 0 bytes if we use empty spaces or black boxes. Either way, the number of supported characters would remain 224.

MaxGraey commented 2 years ago

I see. So you just want to remap some of UTF8 range from 255-352 utf range to 127-224 ASCII range. But I think it should be done only for UTF8 langs and perhaps without built-in encodings. Also need bounds check verification.

SLiV9 commented 2 years ago

Not quite. ASCII treats each byte (0x00 through 0xFF) as a single character, and the 224 characters supported by the WASM-4 font are 0x20 (space) through 0xFF (y with dots), or 32-255.

The Unicode code points matching these 224 printable characters are U+0020 through U+00FF. Numerically, this is still the range 32-255. However, U+0020 through U+007F are encoded in UTF-8 with one byte, whereas U+0080 through U+00FF are encoded with two bytes. For example, U+00C0 (192) is 0xC3 0x80. Hence the need for decoding.

SLiV9 commented 1 year ago

Given that #591 solves the main functionality problem (allowing Rust developers to draw "\x80" without UB), and does not suffer the loss of ergonomics that I was afraid of (because text("Hello world!", 20, 20) still compiles), I no longer feel as strongly about parsing UTF-8. I still think it is more correct to reencode from a programming languages default string encoding to ASCII rather than generate mojibake, but the impact on developers is now much less.

For completeness, here is the output of my test cart for Rust after applying #591: wasm4-screenshot_2022-11-08_ascii_rs It is the same as before (v2.5.3), except text(b"Press \x80 to blink") now works as a safe well-behaved alternative to the str::from_utf8_unchecked hack. This does mean that text("£", 20, 20) does not work (that is, results in mojibake) despite £ being in the character set supported by WASM-4, which feels a bit unintuitive but not inexplicable.

The misnomers textUtf8 and textUtf16 do still annoy me, and in the long term I think renaming them to text8 and text16 (or what have you) would help a lot in clarifying that their behavior is very different from traceUtf8 and traceUtf16. Depending on where the consensus lands on this issue, I could isolate the (backwards compatible) rename I did in #528 from the functional changes.

JerwuQu commented 1 year ago

Updated the original issue with these outtakes

JerwuQu commented 5 months ago

The main issue persists, but to maintain compatibility it cannot change. It has now been documented at least to avoid future confusion.