Open wingo opened 2 years ago
Suggestion: let's require valid UTF-8 for now; and consider relaxing if a need for that turns up.
Reasoning: For backwards compatibility, it's always easier to be strict at first and relax rules later, than be permissive at first and try to introduce stricter requirements after an ecosystem has been established that might rely on the earlier flexibility.
Seems brittle if construction and runtime don't match up. For example, I wonder what tools like Wizer and Binaryen would do when they precompute string data and aren't able to write any possible value (back) to the section. These could check whether to fail and/or bail out, or otherwise would produce modules that behave differently after pre-initialization respectively optimization.
In Java I can declare string constants that are invalid UTF-8. If I compile such valid Java code I can't correct it. To solve this I have two options;
That I would prefer only WTF-8 in this literal section.
Hello old issue :)
I think that WTF-8 in string literal sections makes sense and we should go for it. I don't know precisely how to square it with the MUST language from the WTF-8 spec, though.
Note that there is perhaps via analogy with WTF-16 presence in JVM class files and similar compilation units. For example in the JVM, strings appear to be encoded using "UTF-8", which is not UTF-8 proper: https://docs.oracle.com/javase/specs/jvms/se17/html/jvms-4.html#jvms-4.4.7. Notably they encode non-BMP codepoints as two encoded surrogates! (There's also a hacky encoding of codepoint 0.)
I see these options for string literal encodings:
I think WTF-8 is the best outcome here. Sure, it contradicts the MUST, but the other options are worse:
Therefore I propose to close this ticket after a few days have passed, and leave things as they are.
String literals come from a special data section containing WTF-8-encoded bytes. However, from https://simonsapin.github.io/wtf-8/#intended-audience: "WTF-8 must not be used [...] for transmission over the Internet." (Tx @jakobkummerow for noticing).
So this issue is an open question whether it is OK to embed WTF-8 string literals in WebAssembly modules, or whether somehow having WTF-8 byte sequences in WebAssembly modules poses some larger problem for the ecosystem.