WebAssembly / stringref

Other
37 stars 2 forks source link

Require valid UTF-8 string literals? #2

Open wingo opened 2 years ago

wingo commented 2 years ago

String literals come from a special data section containing WTF-8-encoded bytes. However, from https://simonsapin.github.io/wtf-8/#intended-audience: "WTF-8 must not be used [...] for transmission over the Internet." (Tx @jakobkummerow for noticing).

So this issue is an open question whether it is OK to embed WTF-8 string literals in WebAssembly modules, or whether somehow having WTF-8 byte sequences in WebAssembly modules poses some larger problem for the ecosystem.

jakobkummerow commented 2 years ago

Suggestion: let's require valid UTF-8 for now; and consider relaxing if a need for that turns up.

Reasoning: For backwards compatibility, it's always easier to be strict at first and relax rules later, than be permissive at first and try to introduce stricter requirements after an ecosystem has been established that might rely on the earlier flexibility.

dcodeIO commented 2 years ago

Seems brittle if construction and runtime don't match up. For example, I wonder what tools like Wizer and Binaryen would do when they precompute string data and aren't able to write any possible value (back) to the section. These could check whether to fail and/or bail out, or otherwise would produce modules that behave differently after pre-initialization respectively optimization.

Horcrux7 commented 2 years ago

In Java I can declare string constants that are invalid UTF-8. If I compile such valid Java code I can't correct it. To solve this I have two options;

That I would prefer only WTF-8 in this literal section.

wingo commented 2 years ago

Hello old issue :)

I think that WTF-8 in string literal sections makes sense and we should go for it. I don't know precisely how to square it with the MUST language from the WTF-8 spec, though.

Note that there is perhaps via analogy with WTF-16 presence in JVM class files and similar compilation units. For example in the JVM, strings appear to be encoded using "UTF-8", which is not UTF-8 proper: https://docs.oracle.com/javase/specs/jvms/se17/html/jvms-4.html#jvms-4.4.7. Notably they encode non-BMP codepoints as two encoded surrogates! (There's also a hacky encoding of codepoint 0.)

I see these options for string literal encodings:

  1. Using UTF-8, but with escape sequences, like JS string syntax
  2. Using "UTF-8", but with nonstandard extensions, like the JVM
  3. Using UTF-8, and forbidding isolated surrogates in string literals
  4. Using WTF-16
  5. Relaxing the spec to just have some kind of passive data section not explicitly defined as WTF-8 but which is actually WTF-8
  6. Using WTF-8

I think WTF-8 is the best outcome here. Sure, it contradicts the MUST, but the other options are worse:

  1. embedded escapes would make string decoding more expensive for everyone
  2. this is the worst
  3. this is a strange discontinuity with the rest of stringrefs and makes it less useful as a compile target
  4. this is space-inefficient and just as non-standard
  5. this is squarely in "i respect the letter of the law but not the spirit" territory, and makes eager validation harder
  6. this is... not so bad, in the end?

Therefore I propose to close this ticket after a few days have passed, and leave things as they are.