WebAssembly / design

WebAssembly Design Documents
http://webassembly.org
Apache License 2.0
11.4k stars 694 forks source link

UTF-8 for all string encodings #989

Closed jfbastien closed 7 years ago

jfbastien commented 7 years ago

Currently:

984 opens a can of worms w.r.t. using UTF-8 for strings. We could either:

I'm not opposed to it—UTF-8 is super simple and doesn't imply Unicode—but I want the discussion to be a stand-alone thing. This issue is that discussion.

Let's discuss arguments for / against UTF-8 for all strings (not Unicode) in this issue, and vote 👍 or 👎 on the issue for general sentiment.

jfbastien commented 7 years ago

Argument for UTF-8: it's very simple. encoder and decoder in JavaScript. Again, UTF-8 is not Unicode.

jfbastien commented 7 years ago

Argument against UTF-8: it's ever slightly more complicated than length + bytes, leading to potential implementation divergences.

tabatkins commented 7 years ago

Again, UTF-8 is not Unicode.

What are you even saying? This is a nonsense sentence.

I think you're trying to say that there's no need to pull in an internationalization library. This is true - mandating that strings are encoded in UTF-8 has nothing to do with all the more complicated parts of Unicode, like canonicalization. Those are useful tools when you're doing string work that interfaces with humans, but in the same way that a trig library is useful to people doing math, and not relevant when deciding how to encode integers.

But UTF-8 is literally a Unicode encoding; your statement is meaningless as written. ^_^

jfbastien commented 7 years ago

But UTF-8 is literally a Unicode encoding; your statement is meaningless as written. ^_^

Yes, I'm specifically referring to the codepoint encoding that UTF-8 describes, not the treatment of codepoints proper (for the purpose of this proposal, a codepoint is an opaque integer). Put in wasm-isms, UTF-8 is similar to var[u]int, but more appropriate to characters. Further, UTF-8 isn't the only Unicode encoding, and it can be used to encode non-Unicode integers. So, UTF-8 isn't Unicode.

A further proposal would look at individual codepoints and do something with them. This is not that proposal.

tabatkins commented 7 years ago

And there would be no reason to. No Web API has found the need to introspect on the codepoints beyond strict equality comparison and sorting, unless it's literally an i18n API.

RyanLamansky commented 7 years ago

Another option is byte length + UTF-8 for each code point ( @jfbastien unless this is what you meant when you said UTF-8 for each byte, which I admit didn't make sense to me). I don't think this would make things any more difficult for a primitive parser that doesn't really care, while allowing a sophisticated Unicode library to take a byte array, offset, and length as input and return a string.

I agree with the definition as "UTF-8 code points", which are just integers. The binary spec should leave it at that. Individual embedders can define rules around allowed code points, normalization and other nuances. Analysis tools could provide warnings for potential compatibility issues.

I think error handling decisions should also be left to the embedders. A system that accessed WASM functions by index rather than name has no need for them to be valid (and they'd be easy to skip over with a byte length prefix).

sunfishcode commented 7 years ago

Here's an attempt at summarizing the underlying issues and their reasons. Corrections and additions are most welcome.

Should wasm require module import/export identifiers be valid UTF-8?

My understanding of the reasons against is:

Should wasm recommend UTF-8 in areas where it doesn't require it?

The reason for would be that even if we can't require it, mentioning UTF-8 may discourage needless incompatibilities among the ecosystem.

My understanding of the reason against is that even mentioning UTF-8 would compromise the conceptual encapsulation of string interpretation concerns.

Should wasm specify UTF-8 for name-section names?

The reason for is: The entire purpose of these names is to be converted into strings for display, which is not possible without an encoding, so we should just specify UTF-8 so that tools don't have to guess.

My understanding of the reason against is: If wasm has other string-like things in other areas that don't have a designated encoding (i.e. imports/exports as discussed above), then for consistency sake it shouldn't designate encodings for any strings.

rossberg commented 7 years ago

@sunfishcode provides a good summary, but I want to add three crucial points.

@jfbastien, it would be the most pointless of all alternatives to restrict binary syntax (an encoding) but not semantics (a character set) for strings. So for all practical purposes, UTF-8 implies Unicode. And again, this is not just about engines. If you define names to be Unicode, then you are forcing that on all Wasm eco systems in all environments. And that pretty much means that all environments be required to have some Unicode support.

@tabatkins, I think there is a domain error underlying your argument. None of the strings we are talking about are user-facing. They are dev-facing names. Many/most programming languages do not support Unicode identifiers, nor do tools. Can e.g. gdb handle Unicode source identifiers? I don't think so. So it is quite optimistic (or rather, unrealistic) to assume that all consumers have converged on Unicode in this space.

And finally, the disagreement is not whether Wasm on the Web should assume UTF-8, but where we specify that.

tabatkins commented 7 years ago

I think there is a domain error underlying your argument. None of the strings we are talking about are user-facing. They are dev-facing names. Many/most programming languages do not support Unicode identifiers, nor do tools. Can e.g. gdb handle Unicode source identifiers? I don't think so. So it is quite optimistic (or rather, unrealistic) to assume that all consumers have converged on Unicode in this space.

"dev-facing" means "arbitrary toolchain-facing", which means you need to agree on encoding up-front, or else the tools will have to do encoding "detection" (that is to say, guessing, which is especially bad when applied to short values) or have out-of-band information. Devs are still users. ^_^

If you think a lot of toolchains aren't going to understand Unicode, then I'm unsure why you think they'd understand any other arbitrary binary encoding. If that's your limitation, then just specify and require ASCII, which is 100% supported everywhere. If you're not willing to limit yourself to ASCII, tho, then you need to accept that there's a single accepted non-ASCII encoding scheme - UTF-8.

Saying "eh, most things probably only support ASCII, but we'll let devs put whatever they want in there just in case" is the worst of both worlds.

rossberg commented 7 years ago

Saying "eh, most things probably only support ASCII, but we'll let devs put whatever they want in there just in case" is the worst of both worlds.

@tabatkins, nobody is proposing the above. As I said, the question isn't whether but where to define such platform/environment-specific matters. Wasm is supposed to be embeddable in the broadest and most heterogeneous range of environments, some much richer than others (for example, JS does support Unicode identifiers). Consequently, you want to allow choosing on a per-platform basis. Hence it belongs into platform API specs not the core spec.

tabatkins commented 7 years ago

There's no choice to make, tho! If your embedding environment doesn't support non-ASCII, you just don't use non-ASCII in your strings. (And if this is the case, you still need encoding assurance - UTF-16 isn't ASCII-compatible, for example!)

If your environment does support non-ASCII, you need to know what encoding to use, and the correct choice in all situations is UTF-8.

What environment are you imagining where it's a benefit to not know the encoding of your strings?

tabatkins commented 7 years ago

it would be the most pointless of all alternatives to restrict binary syntax (an encoding) but not semantics (a character set) for strings. So for all practical purposes, UTF-8 implies Unicode.

No, it absolutely doesn't. For example, it's perfectly reasonable to simultaneously (a) restrict a string to the ASCII characters, and (b) dictate that it's encoded in UTF-8. Using ASCII characters doesn't imply an encoding, or else all encodings would be ASCII-compatible! (For example, UTF-16 is not.) So you still have to specify something; UTF-8, being "ASCII-compatible", is fine for this.

Again, if you are okay with restricting these names to ASCII-only, then it's reasonable to mandate the encoding be US-ASCII. If you want it to be possible to go beyond ASCII, then it's reasonable to mandate the encoding be UTF-8. Mandating anything else, or not mandating anything at all (and forcing all consumers to guess or use out-of-band information), are the only unreasonable possibilities.

And again, this is not just about engines. If you define names to be Unicode, then you are forcing that on all Wasm eco systems in all environments. And that pretty much means that all environments be required to have some Unicode support.

Again, this looks like you're talking about internationalization libraries. What we're discussing is solely how to decode byte sequences back into strings; that requires just knowledge of how to decode UTF-8, which is extremely trivial and extremely fast.

Unless you're doing human-friendly string manipulation, all you need is the ability to compare strings by codepoint, and possibly sort strings by codepoint, neither of which require any "Unicode support". This is all that existing Web tech uses, for example, and I don't see any reason Wasm environments would, in general, need to do anything more complicated than this.

lukewagner commented 7 years ago

I'm in favor of mandating utf8 for All The Strings. Pure utf8 decoding/encoding seems like a pretty low impl burden (compared to everything else) for non-Web environments. Also, from what I've seen, time spent validating utf8 for imports/names will be insignificant compared to time spent on everything else, so I don't think there's a performance argument here.

Practically speaking, even if we didn't mandate utf8 in the core wasm spec, you'd have a Bad Time interoperating with anything if your custom toolchain didn't also use utf8 unless you're a total island and then maybe you just say "screw it" and do your own non-utf8 thing anyway... because then who cares.

What I'd realllly like to do, though, is resolve #984, which seems to block on this...

jfbastien commented 7 years ago

@lukewagner I don't think #984 is blocked on this. 😄

lukewagner commented 7 years ago

I guess you're right.

rossberg commented 7 years ago

What environment are you imagining where it's a benefit to not know the encoding of your strings?

@tabatkins, it seems I've still not been clear enough. I don't imagine such an environment. However, I imagine a wide spectrum of environments with incompatible requirements. Not everything is a subset of UTF-8, e.g. Latin1 is still in fairly widespread use. You might not care, but it is not the job of the core Wasm spec to put needless stones in the way of environment diversity.

you'd have a Bad Time interoperating with anything if your custom toolchain didn't also use utf8 unless you're a total island

@lukewagner, I indeed expect that Wasm will be used across a variety of "continents" that potentially have little overlap. And where they do you can specify interop (in practice, name encodings are likely gonna be the least problem for sharing modules between different platforms -- it's host libraries). Even total islands are not unrealistic, especially wrt embedded systems (which also tend to have little use for Unicode).

MI3Guy commented 7 years ago

One of the most difficult parts of implementing a non-browser based WebAssembly engine is making things work the way it does in the browser (mainly the JS parts). I expect that if the encoding doesn't get standardized, we will end up with a de facto standard where everyone copies what is done for the web target. This will just result in it being harder to find information on how to decode these strings.

There may be value in allowing some environments to further restrict the allowed content, but not requiring UTF-8 will just result in more difficulty.

rossberg commented 7 years ago

@MI3Guy, the counter proposal is to specify UTF-8 encoding as part of the JS API. So if you are building a JS embedding then it's defined to be UTF-8 either way and makes no difference for you. (However, we also want to allow for other embedder APIs that are neither Web nor JavaScript.)

MI3Guy commented 7 years ago

Right. My point is if you are not doing a JS embedding, you are forced to emulate a lot of what the JS embedder does in order to use the WebAssembly toolchain.

pipcet commented 7 years ago

Do varuint for number of codepoints + UTF-8 for each codepoint.

I'd just like to speak out against this option. It complicates things, doesn't and cannot apply to user-specific sections, and provides no benefit that I can see—in order to know the number of codepoints in a UTF-8 string, in practice you always end up scanning the string for invalid encodings, so you might as well count codepoints while you're at it.

tabatkins commented 7 years ago

Not everything is a subset of UTF-8, e.g. Latin1 is still in fairly widespread use. You might not care, but it is not the job of the core Wasm spec to put needless stones in the way of environment diversity.

Correct; UTF-8 differs from virtually every encoding once you leave the ASCII range. I'm unsure what your point is with this, tho. Actually using the Latin-1 encoding is bad precisely because there are lots of other encodings that look the same but encode different letters. If you tried to use the name "æther" in your Wasm code, and encoded it in Latin-1, then someone else (justifiably) tries to read the name with a UTF-8 toolchain, they'll get a decoding error. Or maybe the other person was making a similar mistake, but used the Windows-1250 encoding instead (intended for Central/Eastern European languages) - they'd get the nonsense word "ćther".

I'm really not sure what kind of "diversity" you're trying to protect here. There is literally no benefit to using any other encoding, and tons of downside. Every character you can encode in another encoding is present in Unicode and can be encoded in UTF-8, but the reverse is almost never true. There are no relevant tools today that can't handle UTF-8; the technology is literally two decades old.

I keep telling you that web standards settled this question years ago, not because Wasm is a web spec that needs to follow web rules, but because text encoding is an ecosystem problem that pretty much everyone has the same problems with, and the web already dealt with the pain of getting this wrong, and has learned how to do it right. There's no virtue in getting it wrong again in Wasm; every environment that has to encode text either goes straight to UTF-8 from the beginning, or makes the same mistakes and suffers the same pain that everyone else does, and then eventually settles on UTF-8. (Or, in rare cases, develops a sufficiently isolated environment that they can standardize on a different encoding, and only rarely pays the price of communicating with the outside environment. But they standardize on an encoding, which is the point of all this.)

tabatkins commented 7 years ago

So if you are building a JS embedding then it's defined to be UTF-8 either way and makes no difference for you. (However, we also want to allow for other embedder APIs that are neither Web nor JavaScript.)

This issue has nothing to do with the Web or JS. Every part of the ecosystem wants a known, consistent text encoding, and there's a single one that is widely agreed upon across programming environments, countries, and languages: UTF-8.

qwertie commented 7 years ago

I vote for 'Do varuint for length (in bytes) + UTF-8 for each byte'. Assuming that's not a controversial choice - pretty much every string implementation stores strings as "number of code units" rather than "number of code points", because it's simpler - then isn't the real question "should validation fail if a string is not valid UTF-8"?

As I pointed out in #970, invalid UTF-8 can be round-tripped to UTF-16, so if invalid UTF-8 is allowed, software that doesn't want to store the original bytes doesn't have to. On the other hand, checking if UTF-8 is valid isn't hard (though we must answer - should overlong sequences be accepted? surrogate characters?)

On the whole I'm inclined to say let's mandate UTF-8. In the weird case that someone has bytes they can't translate to UTF-8 (perhaps because the encoding is unknown), arbitrary bytes can be transliterated to UTF-8.

rossberg commented 7 years ago

I'm really not sure what kind of "diversity" you're trying to protect here.

@tabatkins, yes, that seems to be the core of the misunderstanding.

It is important to realise that WebAssembly, despite its name, is not limited to the web. We are very cautious to define it in suitable layers, such that each layer is as widely usable as possible.

Most notably, its core is not actually a web technology at all. Instead, try to think of it as a _virtual ISA_. Such an abstraction is useful in a broad spectrum of different environments, from very rich (the web) to very rudimentary (embedded systems), that do not necessarily have anything to do with each other, may be largely incompatible, and have conflicting constraints (that Wasm is in no position to change).

As such, it makes no more sense to impose Unicode on core Wasm than it would, say, to impose Unicode on all string literals in the C programming language. You'd only coerce some potential clients into violating this bit of the standard. What's the gain?

There will, however, be additional spec layers on top of this core spec that define its embedding and API in concrete environments (such as JavaScript). It makes perfect sense to fix string encodings on that level, and by all means, we should.

rossberg commented 7 years ago

PS: A slogan that defines the scope of Wasm is that it's an abstraction over common hardware, not an abstraction over common programming languages. And hardware is agnostic to software concerns like string encodings. That's what ABIs are for.

jfbastien commented 7 years ago

@rossberg-chromium

As such, it makes no more sense to impose Unicode on core Wasm than it would, say, to impose Unicode on all string literals in the C programming language. You'd only coerce some potential clients into violating this bit of the standard. What's the gain?

I agree 100%. This issue isn't about Unicode though, it's purely about UTF-8, an encoding for integers, without mandating that the integers be interpreted as Unicode.

I don't understand if we agree on that. Could you clarify: are you OK with UTF-8, and if not why?

rossberg commented 7 years ago

@jfbastien, would it be any more productive to require UTF-8 conformance for all C string literals?

As I noted earlier, it makes no sense to me to restrict the encoding but not the character set. That's like defining syntax without semantics. Why would you possibly do that? You gain zero in terms of interop but still erect artificial hurdles for environments that do not use UTF-8 (which only Unicode environments do anyway).

jfbastien commented 7 years ago

@jfbastien, would it be any more productive to require UTF-8 conformance for all C string literals?

I don't understand, can you clarify?

As I noted earlier, it makes no sense to me to restrict the encoding but not the character set. That's like defining syntax without semantics. Why would you possibly do that? You gain zero in terms of interop but still erect artificial hurdles for environments that do not use UTF-8 (which only Unicode environments do anyway).

I think that the crux of the discussion.

@tabatkins touched on precedents to exactly this:

Again, this looks like you're talking about internationalization libraries. What we're discussing is solely how to decode byte sequences back into strings; that requires just knowledge of how to decode UTF-8, which is extremely trivial and extremely fast.

Unless you're doing human-friendly string manipulation, all you need is the ability to compare strings by codepoint, and possibly sort strings by codepoint, neither of which require any "Unicode support". This is all that existing Web tech uses, for example, and I don't see any reason Wasm environments would, in general, need to do anything more complicated than this.

So I agree: this proposal is, in your words, "defining syntax without semantics". That's a very common thing to do. In fact, WebAssembly's current length + bytes specification already does this!

I'd like to understand what the hurdle is. I don't really see one.

tabatkins commented 7 years ago

It is important to realise that WebAssembly, despite its name, is not limited to the web.

I just stated in the immediately preceding comment that this has nothing to do with the web. You keep trying to use this argument, and it's really confusing me. What I'm saying has nothing to do with the web; I'm merely pointing to the web's experience as an important example of lessons learned.

As such, it makes no more sense to impose Unicode on core Wasm than it would, say, to impose Unicode on all string literals in the C programming language. You'd only coerce some potential clients into violating this bit of the standard. What's the gain?

You're not making the point you think you're making - C does have a built-in encoding, as string literals use the ASCII encoding. (If you want anything else you have to do it by hand by escaping the appropriate byte sequences.) In more current C++ you can have UTF-16 and UTF-8 string literals, and while you can still put arbitrary bytes into the string with \x escapes, the \u escapes at least verify that the value is a valid codepoint.

All of this is required, because there is no inherent mapping from characters to bytes. That's what an encoding does. Again, not having a specified encoding just means that users of the language, when they receive byte sequences from other parties, have to guess at the encoding to turn them back into text.

You gain zero in terms of interop but still erect artificial hurdles for environments that do not use UTF-8 (which only Unicode environments do anyway).

Can you please point to an environment in existence that uses characters that aren't included in Unicode? You keep trying to defend this position from a theoretical purity / environment diversity standpoint, but literally the entire point of Unicode is to include all of the characters. It's the only character set that can make a remotely credible argument for doing so, and when you're using the Unicode character set, UTF-8 is the preferred universal encoding.

What diversity are you attempting to protect? It would be great to see even a single example. :/

rossberg commented 7 years ago

@tabatkins:

It is important to realise that WebAssembly, despite its name, is not limited to the web.

I just stated in the immediately preceding comment that this has nothing to do with the web. You keep trying to use this argument, and it's really confusing me. What I'm saying has nothing to do with the web; I'm merely pointing to the web's experience as an important example of lessons learned.

What I am trying to emphasise is that Wasm should be applicable to as many platforms as possible, modern or not. You keep arguing from the happy end of the spectrum where everything is Unicode and/or UTF-8, and everything else is just deprecated.

You're not making the point you think you're making - C does have a

built-in encoding, as string literals use the ASCII encoding. (If you want anything else you have to do it by hand by escaping the appropriate byte sequences.) In more current C++ you can have UTF-16 and UTF-8 string literals, and while you can still put arbitrary bytes into the string with \x escapes, the \u escapes at least verify that the value is a valid codepoint.

No, that is incorrect. The C spec does not require ASCII. It does not even require compatibility with ASCII. It allows almost arbitrary "source character sets" and string literals can contain any character from the full set. There are no constraints regarding encoding, it is entirely implementation-defined. There have been implementations of C running on EBCDIC platforms, and that is still supported by the current standard. GCC can process sources in any iconv encoding (of which there are about 140 besides UTF-8), e.g. UTF-16 which is popular in Asia. C++ is no different.

(That should also answer @jfbastien's question.)

All of this is required, because there is no inherent mapping from characters to bytes. That's what an encoding does. Again, not having a specified encoding just means that users of the language, when they receive byte sequences from other parties, have to guess at the encoding to turn them back into text.

Again: this will be suitably specified per environment. When somebody receives a Wasm module from somebody else operating in the same ecosystem then there is no problem. No JS dev will ever need to care.

If, however, somebody is receiving a module from another ecosystem then there are plenty of other sources of incompatibility to worry about, e.g. expectations about API, built-in libraries, etc. Both parties will need to be explicit about their interop assumptions anyway. Agreeing on a name encoding is gonna be the least of their problems.

You gain zero in terms of interop but still erect artificial hurdles for

environments that do not use UTF-8 (which only Unicode environments do anyway).

Can you please point to an environment in existence that uses characters that aren't included in Unicode? You keep trying to defend this position from a theoretical purity / environment diversity standpoint, but literally the entire point of Unicode is to include all of the characters. It's the only character set that can make a remotely credible argument for doing so, and when you're using the Unicode character set, UTF-8 is the preferred universal encoding.

What diversity are you attempting to protect? It would be great to see even a single example. :/

For example, here is a list of embedded OSes: https://en.wikipedia.org/wiki/ Category:Embedded_operating_systems Some of them likely use UTF-8, some won't. Some may find a use for Wasm, most probably won't. But there is no benefit for us in making it less convenient for them.

One entry from that list that you're probably still familiar is DOS. As much as we all like it to die, DOS systems are still lively, and they use OEM.

@jfbastien:

So I agree: this proposal is, in your words, "defining syntax without semantics". That's a very common thing to do. In fact, WebAssembly's current length + bytes specification already does this!

The rare occurrences of such a thing that I am aware of all have to do with providing an escape hatch for implementation-specific behaviour. That's also the only reasonable use case. That makes no sense here, though. If you want to provide such an escape hatch for strings, then why bother requiring UTF-8, instead of allowing any byte string "syntax"? That's syntax without semantics as a disabler, not an enabler.

I'd like to understand what the hurdle is. I don't really see one.

That some clients cannot simply use all byte values but have to go through redundant UTF encodings that have no use in their eco system. That all tools in their tool chains will have to bother with it as well. That it creates additional error cases (out of range values) that wouldn't otherwise exist for them.

Let me ask the other way round: What is the benefit (in their eco systems)? I don't really see one.

flagxor commented 7 years ago

@tabatkins Want to make sure I understand where the dividing line lies. To be clear you're suggesting ONLY utf-8 encoding of code points regardless of if they're invalid in combination (that can be done in 10 lines of code). Bold caps could for instance be used in the spec to indicate: You're doing something wrong if you think you need an internationalization library to implement Wasm?

Goals of this would be:

Questions?

@rossberg-chromium

tabatkins commented 7 years ago

To be clear you're suggesting ONLY utf-8 encoding of code points regardless of if they're invalid in combination (that can be done in 10 lines of code).

Yes, tho I don't believe there are any invalid combinations; there are just some individual codepoints (the ones reserved for UTF-16 surrogates) that are technically invalid to encode as UTF-8. That said, if full byte control is desirable, the WTF-8 encoding does exist, but we should be very explicit about "yes, we want to allow these strings to actually contain arbitrary non-string data in them sometimes" as a goal if we go that way. The WTF-8 (and WTF-16) format is only intended to provide a formal spec for environments that have backwards-compat constraints on enforcing UTF-* well-formedness.

Bold caps could for instance be used in the spec to indicate: You're doing something wrong if you think you need an internationalization library to implement Wasm?

Yes, i18n isn't required in any way, shape, or form. CSS defaults to UTF-8, for example, and just does raw codepoint comparison/sorting when it allows things outside the ASCII range. No reason for Wasm to go any further than this, either.

Is there any danger this becomes a creeping requirement for more validation? I think my core concern in this space would be it will always be an unreasonable burden to swallow say ICU as a dependency.

The web platform has never needed to impose additional validation on bare names so far. My experience suggests it will never be necessary.

I assume this implies the goal of actively [dis -ed]couraging encodings like Latin1 that clash with UTF-8? I.e. toolchains that emit it would be non-compliant, implementations that accept it similarly so.

Yes, with the change to "discouraging" in your words. ^_^ The whole point is that producers and consumers can reliably encode and decode strings to/from byte sequences without having to guess at what the other endpoint is doing. This has been a horrible pain for every environment that has ever encountered it, and there's a widely-adopted solution for it now.

I grok the web has historically had trouble unifying this space due to overlapping use of bits from regions that previously were encoding islands. On the other hand, my impression is that UTF-8 sets up things such that the costs of the transition are disproportionately born by non-ASCII folks, and that some regions have more bake in. I would imagine the unicode transition is a practical inevitability (and nearly complete). Is there some centralized doc / entity we can point at to that addresses how some of the political and regional issues around unicode have been resolved on the web?

Yes, it definitely had issues in the transition; HTML is still required to default to Latin-1 due to back-compat, and there are still some small pockets of web content that prefer a language-specific encoding (mostly Shift-JIS, a Japanese-language encoding). But the vast majority of the world switched over the last two decades, and the transition is considered more or less complete now.

The "UTF-8 burdens non-ASCII folks" has been a pernicious, but almost entirely untrue, rumor for a long time. Most European languages include the majority of the ASCII alphabet in the first place, so most of their text is single-byte sequences and ends up smaller than UTF-16. The same applies to writing systems like Pinyin. CJK langs mostly occupy the 3-byte UTF-8 region, but they also include large amounts of ASCII characters, particularly in markup languages or programming languages, so also, in general, see either smaller or similar encoded sizes for UTF-8 as for UTF-16 or their specialized encodings.

It's only for large amounts of raw text in CJK or non-ASCII alphabets such as Cyrillic that we see UTF-8 actually take up more space than a specialized encoding. These were concerns, however, in the early 90s, when hard drive capacity was measured in megabytes and a slight blow-up in text file sizes was actually capable of being significant. This hasn't been a concern for nearly 20 years; the size difference is utterly inconsequential now.

Wrt to "the Unicode transition", that has already happened pretty universally. A text format that doesn't require itself to be encoded with UTF-8 these days is making a terrible, ahistoric mistake.

I'm not sure of any specific document that outlines this stuff, but I'll bet they exist somewhere. ^_^

RyanLamansky commented 7 years ago

If the goal is to keep the binary spec as pure as possible, let's remove names entirely. All its internal references are based on index, anyway.

Instead, add a mandatory custom section to the JavaScript specification that requires UTF-8. Other environments, such as the Soviet-era mainframe that @rossberg-chromium is alluding to, can define their own custom section. A single WASM file could support both platforms by providing both custom sections. It would be relatively straightforward for custom tooling to generate an obscure platform's missing section by converting a more popular one.

jfbastien commented 7 years ago

If the goal is to keep the binary spec as pure as possible, let's remove names entirely. All its internal references are based on index, anyway.

That's a rework of how import / export works. It's not on the table and should be suggested in a different issue than this one.

rossberg commented 7 years ago

@bradnelson, AFAICS, prescribing a specific encoding but no character set combines the worst of both worlds: it imposes costs in terms of restrictions, complexity, and overhead with no actual benefit in terms of interop. I guess I'm still confused what the point would be.

sunfishcode commented 7 years ago

@rossberg-chromium The primary benefit being sought here is to relieve tools and libraries from the burden of guessing.

Since the primary benefit being sought here is to relieve tools and libraries from the burden of guessing, any of the above variants being discussed (UTF-8 vs. WTF-8 etc.) would be better than nothing because even in the worst case, "I'm positive I can't transcode these bytes literally" is better than "these bytes look like they might be windows-1252; maybe I'll try that". Guessing is known to be error prone, and the primary benefit being sought here is to relieve tools and libraries from the burden of guessing.

rossberg commented 7 years ago

@sunfishcode, how? I'm still lost.

So here is a concrete scenario. Suppose we are on different platforms and I am trying to pass you a module. Suppose for the sake of argument that my platform uses EBCDIC and yours ASCII. Totally legit under the current proposal. Yet, my module will be completely useless to you and your tool chain.

Both these encodings are 7 bit, so UTF-8 doesn't even enter the picture.

So what would UTF-8 bring to the table? Well, I could "decode" any unknown string I get. But for all I know, the result is just another opaque binary blob of 31 bit values. It doesn't provide any information. I have no idea how to relate it to my own strings.

So, then, why would I even bother to decode an unknown string? Well, I wouldn't! I could just as well work with the original binary blob of 8 bit values and save space and cycles. The spec would still require me to spend cycles to vacuously validate the encoding, though.

Considering all that, what would (core) Wasm or tools gain by adopting this particular proposal?

tabatkins commented 7 years ago

AFAICS, prescribing a specific encoding but no character set combines the worst of both worlds: it imposes costs in terms of restrictions, complexity, and overhead with no actual benefit in terms of interop. I guess I'm still confused what the point would be.

We're definitely imposing a character set - the Unicode character set. JF was phrasing things very confusingly earlier, pay no attention. That doesn't mean we need to add checks to Wasm to actually enforce this; decoders are typically robust enough to deal with invalid characters. (The web, for example, typically just replaces them with U+FFFD REPLACEMENT CHARACTER.)

So here is a concrete scenario. Suppose we are on different platforms and I am trying to pass you a module. Suppose for the sake of argument that my platform uses EBCDIC and yours ASCII. Totally legit under the current proposal. Yet, my module will be completely useless to you and your tool chain.

You need to stop pretending multi-decades old systems are not only relevant, but so relevant that they justify making decisions that go against everything we've learned about encoding pain over those same multiple decades. You're helping no one with this insistence that Web Assembly contort itself to maximize convenience when chattering with ancient mainframes, while ignoring the benefit from everyone else in the world being able to communicate textual data reliably. You're just going to hurt the language and make 99.9% (as a very conservative estimate) of users' lives harder.

Many different systems went thru all of this mess. The encoding wars were not fun; they wasted a lot of money and a lot of time and resulted in a lot of corrupted text. We finished those wars, tho. Unicode was created, and promulgated, and became the dominant character set across the entire world, to the point that all other character sets are literally nothing more than historical curiosities at this point. We still have low-level simmering fights over whether to use UTF-16 vs UTF-8, but at least those two are usually easy to tell apart (look at the BOM, or look for a preponderance of null bytes), and overall UTF-8 dominates handily.

Your insistence on encoding freedom ignores all of this history, all the lessons learned in the two decades since Unicode was introduced. It ignores all the experience and expertise that have gone into designing modern systems, which have had the effect of making encoding issues invisible to most users, because systems can count on everything being encoded in a particular way. You are going to create serious, pernicious, expensive problems if you persist in this, one mojibake at a time.

sunfishcode commented 7 years ago

@rossberg-chromium

So here is a concrete scenario. Suppose we are on different platforms and I am trying to pass you a module. Suppose for the sake of argument that my platform uses EBCDIC and yours ASCII. Totally legit under the current proposal. Yet, my module will be completely useless to you and your tool chain.

So what would UTF-8 bring to the table? Well, I could "decode" any unknown string I get. But for all I know, the result is just another opaque binary blob of 31 bit values. It doesn't provide any information. I have no idea how to relate it to my own strings.

UTF-8 would tell you exactly how to relate it to your own strings. That's exactly the problem that it solves. (WTF-8 would too when it can, and it would tell you unambiguously when it can't.)

Do you mean an arbitrary data structure mangled into string form and then encoded as UTF-8? It's true that you wouldn't be able to demangle it, but you could at least unambiguously display the mangled name as a string, which is an improvement over not having anything for some use cases.

Do you mean the discussion above about using UTF-8 as an encoding of opaque integers and not Unicode? I think the discussion has gotten somewhat confused. It's tempting to call encoding "syntax" and internationalization "semantics", but that obscures a useful distinction: UTF-8 can still say that a certain byte sequence means "Ö" without saying what consumers have to do with that information. Used in this way, it is an encoding of Unicode, but it doesn't require the kind of cost that "Unicode Support" has been used to suggest above.

So, then, why would I even bother to decode an unknown string? Well, I wouldn't! I could just as well work with the original binary blob of 8 bit values and save space and cycles. The spec would still require me to spend cycles to vacuously validate the encoding, though.

I've now built a SpiderMonkey with full UTF-8 validation of wasm import/export identifiers, including overlong and surrogates. I was unable to detect a performance difference in WebAssembly.validate, either on AngryBots, or on a small emscripten-compiled testcase that nonetheless has 30 imports.

The spec is a compromise between multiple concerns. I appreciate the concern of startup time, so I've now conducted some experiments and measured it. I encourage others to do their own experiments.

annevk commented 7 years ago

Further, UTF-8 isn't the only Unicode encoding, and it can be used to encode non-Unicode integers. So, UTF-8 isn't Unicode.

Which integers can UTF-8 encode that are not part of Unicode (i.e., outside the range U+0000 to U+10FFFF)? That statement seems false.

tabatkins commented 7 years ago

If you don't validate your characters, you can encode any 21-bit integer.

annevk commented 7 years ago

Not quite sure why we wouldn't validate...

@flagxor https://encoding.spec.whatwg.org/ describes the various encodings exposed to the web. Note that none of them go outside the Unicode character set, but they're obviously not all byte-compatible with each other.

tabatkins commented 7 years ago

What would "validation" do? Make your wasm program invalid? I don't think there's any actual consequences that can be reasonably imposed.

Like, using an invalid escape in CSS just puts a U+FFFD into your stylesheet, it doesn't do anything weird.

jfbastien commented 7 years ago

@annevk:

Further, UTF-8 isn't the only Unicode encoding, and it can be used to encode non-Unicode integers. So, UTF-8 isn't Unicode.

Which integers can UTF-8 encode that are not part of Unicode (i.e., outside the range U+0000 to U+10FFFF)? That statement seems false.

At a minimum: U+FFFE and U+FFFF are noncharacters in Unicode. The codepoints (the integers values) will never be used by Unicode to encode characters, but they can be encoded in UTF-8.

annevk commented 7 years ago

They are still Unicode code points though. I wouldn't focus too much on "characters".

annevk commented 7 years ago

@tabatkins decoding to U+FFFD is reasonable, but that limits the number of integers you can get.

luser commented 7 years ago

As such, it makes no more sense to impose Unicode on core Wasm than it would, say, to impose Unicode on all string literals in the C programming language. You'd only coerce some potential clients into violating this bit of the standard. What's the gain?

You might take note that C11 added char16_t and char32_t types as well as a u prefix for UTF-16-encoded string literals, a U prefix for UCS-4-encoded string literals, and a u8 prefix for UTF-8 encoded string literals. I didn't dig quite deep enough to find their rationale for adding them, but I assume "dealing with Unicode in standard C/C++ is a nightmare" is at least part of the motivation.

rossberg commented 7 years ago

@tabatkins, @sunfishcode, okay, so you are not talking about the same thing. But AFAICT @jfbastien has been stating explicitly and repeatedly that his proposal is about specifying UTF-8 without the Unicode character set.

That also is the only interpretation under which the claim of low cost holds up.

Because if we actually do assume that UTF-8 implies Unicode then this requirement certainly is much more expensive than just UTF-8 encoding/decoding for any tool on any system that does not yet happen to talk (a subset of) Unicode -- they'd need to include a full transcoding layer.

@tabatkins, core Wasm will be embedded in pre-existing systems -- sometimes for other reasons than portability -- that it has no power to change or impose anything on. If they face the problems you describe then those exist independent of Wasm. We cannot fix their problems.

The likely outcome of trying to impose Unicode on all of them would be that some potential ones will simply violate that part of the specification, rendering it entirely moot (or worse, they'll disregard Wasm altogether).

If OTOH we specify it at an adequate layer then we don't run that risk -- without losing anything in practice.

rocallahan commented 7 years ago

Because if we actually do assume that UTF-8 implies Unicode then this requirement certainly is much more expensive than just UTF-8 encoding/decoding for any tool on any system that does not yet happen to talk (a subset of) Unicode -- they'd need to include a full transcoding layer.

What platforms exist that use a native character set that's not Unicode, not ASCII, have no facilities for converting those characters to/from Unicode, and would need to use non-ASCII identifiers in Wasm? (I mean really exist, not some hypothetical Russian organization that decides to use Wasm in DOS.)

flagxor commented 7 years ago

@rocallahan I believe @rossberg-chromium is concerned (or at least I would be) with devices like embedded systems, which would not want the added cost of a full ICU library. They would either be forced to accept bloat, not do full validation, or not accept wasm files containing non-ascii character (which they might not have control over).

Also, strictly speaking, such devices often include hardware that do have non-standard character sets like: https://www.crystalfontz.com/product/cfah1602dyyhet-16x2-character-lcd?kw=&origin=pla#datasheets https://www.crystalfontz.com/products/document/1078/CFAH1602DYYHET_v2.1.pdf (Which has a goofy mixed ascii + latin1 + japanese character set) But the concern is what are you obliged to validate, which is relevant regardless.

@tabatkins though I thought has indicated that the intent is: