Why should strings be lists of Unicode Scalar Values?

WebAssembly / interface-types

Other

642 stars 57 forks source link

Why should strings be lists of Unicode Scalar Values? #135

Closed lukewagner closed 3 years ago

lukewagner commented 3 years ago

This issue lays out the reasoning for why I think strings should be lists of Unicode Scalar Values (as currently written in the explainer). This is a fairly nuanced question with the reasoning currently scattered around a number of issues, repos and specs, so I thought it would be useful to collect it all into one focused issue for discussion. The issue reflects discussions with a bunch of folks recently and over the years (@annevk, @hsivonen, @sunfishcode, @fgmccabe, @tschneidereit, @domenic), so I won’t claim credit for the reasoning. Also, to be clear, this issue only answers half of the overall question about string encoding, but I think it’s the first question we have to answer before we can meaningfully talk about string encodings.

(Note: I intend to update the OP in-place if there are any inaccuracies so that it represents a coherent argument.)

First, a bit of context:

Current proposal

As background, the Unicode standards provides two relevant definitions:

Code Point: an integer referring to the Unicode codespace in the range [0, 0x10FFFF].
Unicode Scalar Value (USV): a Code Point other than a surrogate and thus an integer in one of ranges [0, 0xD7FF] or [0xE000, 0x10FFFF].

Based on these definitions, the current explainer proposes:

The char interface type is a USV
The string interface type is an abbreviation for list char

Thus, string, as currently proposed, contains no surrogates (not just no lone surrogates). For reference: a pair of surrogate Code Units in a valid UTF-16 string is decoded into a single USV and thus valid UTF-16-encoded strings will never decode to strings containing any surrogates.

This is not an encoding or in-memory-representation question

The question of whether strings are lists of Unicode Scalar Values is not a question of encoding or memory representation; rather, it’s a question of: “what are the abstract string values produced by decoding and consumed by encoding?”. Without precisely defining what the set of possible abstract string values is, we can’t even begin to discuss string encoding/decoding since we don’t even know what it is we’re trying to encode or decode. This is especially true in the context of Interface Types, where our goal is to support (via adapter functions) fully programmable encoding/decoding in the future.

Thus, if we’re talking about the abstract strings represented by languages like Java, JS and C#, we’re not talking about “WTF-16” (which is an encoding); we’re talking about “lists of code points not containing surrogate pairs (but potentially containing lone surrogates)”, which for brevity I’ll call Wobbly strings, since Wobbly strings are what a Java/JS/C# string can be faithfully decoded into and encoded from. In particular, a Wobbly string can be encoded by either WTF-8 or WTF-16. Note that the set of Wobbly strings is subtly different and smaller than “lists of Code Points” because surrogate pairs decode into necessarily-non-surrogate code points, so there is no way for a Java/JS/C# string to decode into a surrogate pair. The only major languages I know of whose abstract strings are actually “lists of Code Points” are Python 3 and Haskell.

This is a Component Model question

As of our recent CG-05-25 polls, the Interface Types proposal now has the goals and requirements of the Component Model (as presented and summarized). Concretely, this means we’re explicitly concerned with cross-language/toolchain composition, virtualizability and embeddability, which means we’re very much concerned with whether interfaces using string will be consumable and implementable by a wide variety of languages and hosts with robust, portable behavior. Thus, use cases exclusively focused on particular combinations of languages+hosts may need to be solved by separate proposals targeting those specific languages+hosts if they are in conflict with the explicit goals of broad language/host interoperability.

With all this context in place, I’ll finally get to the reasons for defining string to be a list of USVs:

Reason 1: many languages have no good way to consume surrogates

I think there are a few categories of affected languages (this is based on brief spelunking, so let me know if I got this wrong and I’ll update it):

First, there are languages that simply fix UTF-8 for their built-in string type, in some cases exposing UTF-8 representation details directly in their string operations. The popular languages I found in this category are: Elixir, Julia, Rust and Swift.

Second, there are languages which define strings as “arbitrary arrays of bytes”, leaving the interpretation up to the library functions that operate on them. For the languages in this category that I looked into, the default encoding (for source text and string literals and sometimes built-in syntax like iteration) is increasingly implicitly assumed to be UTF-8 (due to the fact that, as detailed below, most I/O data is UTF-8). While it may seem like these languages have the most flexibility (and thus ability to accommodate surrogates), when porting existing code, the implicit dependency on UTF-8 (in the form of calls to UTF-8-assuming library functions scattered around the codebase) makes targeting anything other than UTF-8 challenging. The popular languages I found in this category are: C/C++, Go, Lua, PHP and Zig.

Third, there are languages that support a variety of encodings and conversion between them, but still disallow surrogates (among other reasons being that they aren’t generally transcodable). The popular languages I found in this category are: R and Ruby.

In all of these categories, the author of the toolchain that is binding the language to the Interface Types string has no great general option for what to do when given a surrogate:

Make incoming surrogates trap. This approach is attractive as it simply makes surrogates “someone else’s fault”, and thus not a corner case that all code in the language’s ecosystem has to worry about. This is an easy answer to pick, however it would make these languages second-class in the component ecosystem because they wouldn't be able to implement the same APIs.
Replace incoming surrogates with the replacement character. This happens by default in many places in many of the above languages that I saw, so it’s also a reasonable default option that avoids putting any burden on the language ecosystem at large. But, as with the previous option, this would make these languages second-class as they wouldn’t be able to faithfully implement the same APIs as other languages.
Produce non-UTF-8 byte strings. This isn’t possible for languages in the first and third categories and risky for languages in the second, due to the increasingly prevalent implicit assumption of UTF-8 noted above. Moreover, unlike the above two options, this is not a “spot fix”: it requires all ported code to use the appropriate non-UTF-8 string operations.
Escape surrogates into valid strings. This could make various simple round-tripping use cases Just Work, without hitting the above snags, but this option implicitly introduces a new micro-format that will need to be supported by any non-trivial string operation that works with the contents of the string (e.g., file system operations), so it’s also not a “spot fix”; it needs ecosystem adoption. Also, escaping can introduce collisions (leading to data corruption) with pre-existing strings since there are no code point sequences reserved for this purpose.
Produce a non-standard/builtin string. This option either requires large-scale changes (converting whole codebases to using the new, non-standard string), which blocks porting use cases, or requires a coercion some time later into the standard string, which means picking one of the above options.

For any particular use case, one of these options may be obvious. However, toolchains have to handle the general case, providing good defaults. In addition to the variable ecosystem cost of the different options, there is also a fixed non-negligible cost in wasted time for the N teams working on the N language toolchains, each of which will have to page in this whole problem space and wade through the space of options. In contrast, with a list of USVs, all the above languages can just do the obvious thing they’re already doing.

Reason 2: strings will often need to be serialized over standardized protocols and media formats, which usually disallow surrogates

A common use of Interface Types will be to describe I/O APIs (e.g., for passing data over networks or reading/writing different media formats). Additionally, several of the component model’s virtualizability use cases involve mocking non-I/O APIs in terms of I/O (e.g., turning a normal import call into an RPC, logging call parameters and results, etc). In both these cases, surrogates run in direct conflict with the binary formats of most existing standard network protocols and standard media formats.

In particular, just considering Web-relevant character sets:

The RFC 2277: IETF Policy on Character Sets and Languages specifies that “When using other charsets than UTF-8, these MUST be registered in the IANA charset registry, if necessary by registering them when the protocol is published.”, and there are no IANA charsets that include surrogates.
The W3C Architectural Specification specifically calls out “Specifications MUST NOT allow the use of surrogate code points.”
The preface of the WHATWG Encoding Standard says “The UTF-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the UTF-8 encoding.”
If a format is to support line-breaking or collation, the Unicode specification says the behavior given surrogates is undefined, possibly resulting in an error.
All XML-based formats reject surrogates.
Popular RPC protocols such as Protobufs and CapnProto mandate UTF-8 for strings/text.
GraphQL strings mandate UTF-8.

On the Web, new APIs and formats created over the last 10 years simply mandate UTF-8, including:

RFC 8259, for all JSON documents that aren’t shared as part of a closed ecosystem.
JS files loaded in newer Worker and ES Module contexts.
The WebSockets text stream APIs.
The json and text getter functions of fetch, XHR and Blob APIs.

There’s also a recent proposal to make this direction more-officially part of the W3C’s design principles.

Thus, insofar as a string needs to transit over any of these protocols, formats or APIs, surrogates will be a problem and the implementer of the mapping will have roughly the same bad options listed above as the language toolchains have.

While it’s tempting to say “that’s just a specific precondition of particular APIs, not the string type’s problem”, the virtualization goals of the component model mean that any interface might be virtualized, so the fact that a string is being used for one of the above is not a detail of the API. In contrast, all these protocols and formats can easily represent lists of USVs.

Reason 3: even the WTF-16 languages will have a bad time if they actually try to pass surrogates across a component boundary

Because of the above two reasons, from the perspective of a WTF-16-language-implemented component, it is a very risky proposition to pass a surrogate across a component boundary (parameter of an import or result of an export). Why? Because there’s no telling whether the other side will trap, convert the surrogate into a replacement character, get mangled or trigger undefined/untested behavior. As an author of a component, there’s also not a fixed set of clients or hosts (that’s the point of components).

Thus, to produce widely-reusable and portable components, even a toolchain for a language that allows lone surrogates would be advised to conservatively scrub these before passing strings to the outside world. In a sense, this is nothing new on the Web: despite JSON being derived from JS, JSON doesn’t allow surrogates while JS does, thus there is an inherent scrubbing process that happens when JS communicates with the outside world via JSON (and similarly with WebSockets, fetch(), etc). Accordingly, the WTF spec specifically advises against ever being used outside of “self-contained systems”.

As an illustrative example: consider instead defining string to be a list of Code Points. As explained above, this would mean string was a superset of the Wobbly strings supported by Java/JS/C#. Why might we do this? For one thing, it would capture the full expressive range of Python 3 and Haskell strings and APIs (which is the same argument for supporting Wobbly strings, just for a smaller set of languages). For another, it would give us a simple definition of char (= Code Point) and string (= list char), which has a number of practical benefits (in contrast to Wobbly strings, which cannot be a “list char” for any definition of “char”). However, now the vast majority of languages and hosts would have to resort to a variant of the abovementioned workarounds which means Python 3 and Haskell would have a Bad Time attempting to actually take advantage of this increased string expressivity. Thus, there would be a distributed cost without a commensurate distributed benefit. I think the situation is the same with Wobbly strings, even if the partitioning of languages is different.

What about binary data in strings?

One potential argument for surrogates is that they may be necessary to capture arbitrary binary data, particularly on the Web. To speak to this concern, it’s important to first clarify something: Web IDL has a ByteString type that is used for APIs (predominantly HTTP header methods like Headers.get()), where a ByteString is intentionally an arbitrary array of bytes. However, ByteString does this not by interpreting a JS string as a raw array of uint16s (which would have a problem representing byte strings of odd length), but by requiring each JS string element (a uint16 value) to be in the range [0, 255], throwing an exception otherwise. Since surrogates are outside the range [0, 255], this means that the one place in the Web Platform where binary data is actually appropriate, surrogates are irrelevant.

Outside ByteString use cases, there’s still a theoretical possibility of wanting to round-trip binary data through DOMString APIs. Talking to folks who have worked for years on the Web IDL and Encoding specs (@annevk, @hsivonen, @domenic), they’re not aware of any valid use cases for such usage of DOMString. Indeed, the TextDecoder API does not provide any way to produce a non-USVString, due to this same lack of use cases. In fact, there is currently no direct way (i.e., not involving String.fromCharCode et al) to decode an array of bytes into a non-USVString on the Web Platform today.

Instead, the natural way for a component to pass binary data is a list u8 or list u16, using JS glue code to convert the byte array into a JS string. If these use cases were found and found to be on performance sensitive paths in real workloads on the Web, then it seems like a Web-specific solution would be appropriate, and I can think of a number of options for how to optimize this path by adding things to the JS API. But ultimately, as an optimization, I don’t think this is something we should preemptively add without compelling data.

aardappel commented 3 years ago

Wrapping my head around why this is necessary, I found it funny that the reason this issue even exists is because we have a range of values that are dedicated to indicating extension in a 16-bit scenario, that do not overlap (have a unique value) regardless of whether we are actually using a 16-bit encoding. Most encodings that provide extension values overlap with the values being encoded, e.g. in UTF-8, values 0x80..0xFF indicate an extension is present, but these are not unique as code points since 0x80..0xFF in code points are actual characters. So this question never comes up. So UTF-16 would have been better off with an encoding that overlaps, but I guess that would have made it less easy for software to ignore that UTF-16 is actually variable size..

dcodeIO commented 3 years ago

Interesting thought :) Makes me wonder in turn if the Unicode standard could reasonable settle this decade long issue by adopting what WTF-8 does, and perhaps, say, assigning the visual representation � to (lone) surrogate code points as well. Then it would overlap, roundtrip both ways, that is if I am not missing something, but most notably render future discussions for or against obsolete. Perhaps not labeling it "potentially ill-formed" but "relaxed", "lenient" or "practical" would have helped the situation as well, heh. Would require, however, that UTF-8 implementations need to merge previously split surrogates upon concatenation, and yeah, that would be a considerably large nudge. Too large, most likely.

dcodeIO commented 3 years ago

Thanks for the thorough writeup, Luke :)

This is a Component Model question

As of our recent CG-05-25 polls, the Interface Types proposal now has the goals and requirements of the Component Model (as presented and summarized).

I have been one of the against votes, so perhaps it makes sense to explain why I voted this way. My impression was that deciding for all the next steps was too early. String semantics/encodings in particular have been hotly discussed in the past without resolution so far, so it felt a little odd to me to tie this question (i.e. concretely propose USVs/UTF-8 as if it was natural) to the component model. I worried that the group has not been sufficiently informed on this particular ingredient before polling, which was my motivation for rushing out my presentation with my concerns the weekend before to provide some background. I recognize that this may not have been your intention, of course, but this was my thought process at the time. I appreciated that you clarified during the meeting that we are not actually polling on strings, but now I must admit that I am a little confused as the presence of the component model is used as an argument.

I also voted neutral on the general direction of the component model, because I do not see how it helps the more Web-focused use cases and anticipations I am seeing. Now I was not against it (if others want components I would be fine with it), but if it turns out that its existence is used to justify harming other straight-forward use cases (say where one's component is basically the combination of Wasm + JavaScript), I would decide differently in the future when similar questions arise.

Concretely, this means we’re explicitly concerned with cross-language/toolchain composition, virtualizability and embeddability, which means we’re very much concerned with whether interfaces using string will be consumable and implementable by a wide variety of languages and hosts with robust, portable behavior.

Here, I believe the "robust" part goes both ways:

If we pick "list of Unicode Scalar Values" (subset, UTF-8/16) we are going to randomly break some languages and their JS interop
if we pick "list of Unicode Code Points except surrogate pairs" (superset, WTF-16/8) we are not hard breaking anyone

I understand of course that

Reason 1: many languages have no good way to consume surrogates

but I think this is a rather weak argument compared to making occasional breakage the default for many languages on the contrary. I, and perhaps others as well, would prefer a component that works 100% of the time with an additional check over a component that works just 99.9% of the time while risking anything from annoyances to hazards otherwise. A fuzzer for instance will find this reliably, and so will millions of unintentionally fuzzing users who are not necessarily aware of all the ins and outs of string encodings.

Also

Reason 2: strings will often need to be serialized over standardized protocols and media formats, which usually disallow surrogates

may be true, but is in my opinion not a very compelling precedent for what should happen in between two function calls. The more modular Wasm becomes, the harder it will become to tell where a function lives respectively if a string argument/return crosses a boundary or not. Plus, what may work today may stop working in the future, and we are certainly on a trajectory towards more breakage, not less.

As such I do not think that basing

Reason 3: even the WTF-16 languages will have a bad time if they actually try to pass surrogates across a component boundary

on the above two reasons is very meaningful. The typical case for these languages will most likely be to interface with modules written in the same language or with JavaScript. Perhaps also WASI here or there, but WASI mostly consumes strings, say as file system paths, which are fine to be "not found", while otherwise returning either raw bytes or well-formed strings anyhow. As such I would question the practical value of this reason.

And of course this does not only apply to

What about binary data in strings?

in that some languages, by string API design, make it overly easy to accidentally split a surrogate pair into half (can be as easy as a substring(0, 1), sometimes yielding something akin to "binary data"), especially when a user is unaware of the underlying string encoding details. And the creators of these languages didn't deliberately decide to make it that easy, but it's merely a side-effect of them valuing backwards-compatibility even more when they upgraded from UCS-2 to UTF-16 with the Unicode Standard not leaving them a better choice. So I think we should do the same, and value backwards-compatibility over Unicode hygiene, which may seem useful in theory but in practice often is the opposite, so we can serve these languages well.

If these use cases were found and found to be on performance sensitive paths in real workloads on the Web, then it seems like a Web-specific solution would be appropriate

I can appreciate the direction here of at least keeping the door open to solve it in the Web embedding, but I would once more want to emphasize that this is not a problem exclusive to the Web embedding. I do not see, for instance, how a toolchain would decide whether to utilize a different type or just use a string, as it may not have the necessary knowledge upfront and cannot statically evaluate the contents of every string. Also, in many cases, a simple module will just be the entire component. This is also especially problematic in code migration, where I totally expect JavaScript modules becoming replaced with WebAssembly modules over time. In the current form, this can break for any language, while otherwise it would only happen for some of them, and likely not in the typical case.

On a more general note, it was obvious to me since the beginning that higher level languages like Java will have a better stab at seamlessly integrating with the Web platform, not only because some of them share a string encoding with JavaScript, but also because they already have matching concepts of, say, strings being references, that in the ideal case can be shared with JavaScript, say with GC where everything lives in a common heap anyway. The component model with its restrictions, on the other hand, seems as if it is on a different trajectory that may in the future even influence other proposals as a kind of precedent, and this makes me really sad because I always hoped we could embrace the sheer potential of a future where JavaScript and WebAssembly become one ("component").

Lastly, in the presence of a proper escape hatch for affected languages, I would be fine with a default string type that leads us into a well-formed feature. But without a mechanism that reliably prevents occasional breakage, I fear we are about to get into real trouble (i.e. in the worst case CVEs on or in combination with the IT MVP for applications that worked perfectly fine including when transpiled to JS but not anymore when compiled to Wasm), and would hence strongly prefer "list of Unicode Code Points except surrogate pairs" (WTF-*) over the much less compelling argument of Unicode hygiene.

lukewagner commented 3 years ago

I appreciated that you clarified during the meeting that we are not actually polling on strings, but now I must admit that I am a little confused as the presence of the component model is used as an argument.

I don't think it's possible to decide an issue like this one in the absence of an agreed-upon set of goals, use cases and requirements, which is what "the component model" refers to. Now that we have strong CG agreement on this scope, it gives us the appropriate context in which to discuss this question. I don't think there's an alternative approach to hard questions like this.

I also voted neutral on the general direction of the component model, because I do not see how it helps the more Web-focused use cases and anticipations I am seeing.

It's totally reasonable not to be particularly motivated by the component model -- it's not expected to address 100% of use cases or be a universal answer to all interoperability questions; that's why we've adopted a layered approach. However, it's clear that there are many other folks who are strongly motivated by these goals, so I don't think we can simply set aside the component model in this discussion. For a different set of goals, a different proposal is appropriate.

If we pick "list of Unicode Scalar Values" (subset, UTF-8/16) we are going to randomly break some languages and their JS interop

It won't be random, as it will happen quite regularly and independent of the context in which the component is embedded. In contrast, allowing surrogates to cross component boundaries would lead to the random (from the perspective of an individual component) failures. This is the crux of the matter when you consider the full goals of the component model. I agree that if you're restricting your set of goals to focus more exclusively on JS and Web this is less of a concern, but that's not the context of this layered proposal.

I want to reemphasize another point which is: like wasm, the component model is not a one-shot standard that has to have everything from day 1. Like wasm, our goal is to start with an MVP and iterate based on real-world experience. Thus, the question to ask isn't: "do there exist any use cases for passing surrogates?" but, rather, "will the initial release of the component model not be viable without the ability to pass surrogates between components?". The data we have here is the years of experience of folks working on Web standards suggests that surrogates are not necessary, which is supported by all the standards evolution described above.

dcodeIO commented 3 years ago

Given Murphy's law, what you are proposing seems unnecessarily risky. The real-world experience we want to obtain is to some extent (rare) breakage, and that doesn't make it a proper foundation to build a house on for me. I would once more like to remind of the claim that post-MVP "is just an optimization", which is what we voted on, but as far as I am concerned remains unproven, plus not everyone in the group may have been properly informed about when placing their vote.

As such I would argue that we are better off when starting with the more inclusive string semantics that are true to the just-an-optimization aspect and gives us the foundation we need to iterate in the future. This adds three more options at the end of the day:

restricting string semantics post-MVP when we can actually be 100% sure
accounting for this case in well-formed languages' standard libraries (merge lead/trail surrogates upon concat)
adding an ideal copy-only WTF16String to well-formed languages' standard libraries or ecosystems (i.e. opt-in when interfacing with JS)

Pre-existing "experience of folks working on Web standards" doesn't convince me at least, especially because we are going to compile a lot of stuff to the Web that hasn't been there before.

Btw, it could be as integrated as adding an option like list.lower_canon sanitize=true/false while even retaining the option of well-formedness. I would be fine with that, as I consider well-formed use cases important as well, but I would not be OK with leaving the more inclusive use case out. Add UTF-16/32/Latin-1 to the mix one day and we are in a position that we may not even need adapter functions (for strings) anymore. Respective MVP requirement could be: Support Unicode-like string encodings observed in the wild without the need for adapter functions. That's inclusive, that's neutral, that's minimum viable imo. And, of course, I would finally shut up. Even better, I would be fully on board :)

lukewagner commented 3 years ago

Today, whenever JS talks to the outside world (through HTTP APIs, JSON, gRPC, etc), surrogates are replaced with replacement characters and noone considers this data loss because that's simply the expectation when talking to the outside world. As shown in this slide, the component model is not meant to take the place of language-specific modules/packages/libraries, but, rather, encapsulate linked collections of these (making components more like lightweight processes). Thus, the component model is explicitly adopting the same "talking to the outside world" model where surrogates are not expected.

If we are really worried about silent breakage, though, the fix would be to have surrogates trap instead of producing replacement characters, so the errors could be caught early and fixed easily. I'm open to discussing that more.

(Also, just to clarify, "post-MVP" doesn't refer to a singular follow-up proposal (adding adapter functions), but rather a long sequence of feature proposals over time, the same as with core wasm. Thus, post-MVP is not restricted to only being for optimization by any means.)

dcodeIO commented 3 years ago

I am still not convinced that these APIs, that require well-formedness under the hood by definition of being HTTP APIs, ~~JSON~~ or gRPC, set a compelling precedent for what shall happen in between function calls of modules potentially written in the same language, or when interfacing with JS. As far as I am concerned we are comparing apples and oranges here, since a bunch of function calls that do not sanitize (as it would risk to corrupt half-way) typically happen prior to hitting say the HTTP API that does sanitize because it has a very good reason to. Anticipating that every Wasm module utilizing IT is akin to these APIs seems like quite a stretch.

The theoretical other extreme would be to consider that every string API call should sanitize, which would break languages straight away. We are somewhere in the middle, and given that there are even more APIs in JS for instance (that can be considered modules), and that these deliberately do not sanitize since there is no reason to risk that, I would say we are much closer to function calls here. Unless we want to encourage shipping monoliths only, perhaps, but I am not sure that's a goal :)

(Btw, I'd absolutely prefer replacement over erroring for separate reasons if my viewpoint cannot find consensus. Appreciate UTF-16/Latin-1 being considered.)

lukewagner commented 3 years ago

I think there is plenty of room in the wasm ecosystem for new ABIs specifically designed for allowing closely-related languages to integrate more tightly than the component model allows. This is already the case with the existing tooling-conventions ABI which is the basis for C/C++/Rust(/FORTRAN?) to link together and pass pointers to memory and functions back and forth. I can imagine another totally different ABI designed specifically around native-JS integration that could be more like what you want. But for the component model, the virtualization goals imply that a component should never assume it knows the language (and whether its wasm or native) of its imports nor of the caller of its exports.

dcodeIO commented 3 years ago

Do you think this could be designed in a way that it becomes composable? Say, either use just Interface Types to achieve something lightweight as I am envisioning, or Component Model over Interface Types for stronger guarantees? Like, so far it really only differs in string well-formedness as far as I am concerned.

One way to achieve this perhaps could be if JS could participate in a Component (achieving Wasm + JS inside) as if it were just one of many modules, but outside of the component we'd enforce stricter guarantees that are useful in more complex scenarios.

dcodeIO commented 3 years ago

Some use cases I have in mind are:

Use APIs that give you unpaired surrogates with WebAssembly. This appears to be true for keyboard events in browsers currently, and there may be many more anomalies like this in the various languages that we want to compile to Wasm.
Compile C#/Java/JS-like string manipulation routines to WebAssembly and use them as a module, either from within JavaScript or from other WebAssembly modules. Would currently sanitize ephemeral isolated surrogates early and make it unfeasible. One more concrete use case here is to provide a JSString for interop purposes that can be used in Non-Web environments.
Compile a StringBuilder or similar string utility to WebAssembly and use it as a module. Similar to string manipulation routines, this would currently prematurely sanitize and be unfeasible.
Compile string encoders or decoders to WebAssembly and use them as a module, either from within JavaScript or other WebAssembly modules. Would currently be unfeasible for Unicode-like encodings because of mandatory lossiness.

On the other hand, there is one fun use case being possible with the USV restriction: wasm-string-sanitize, a practically zero code string sanitizer that works universally across every environment supporting Interface Types. Not saying that someone should build that, as it would lead the whole thing ad absurdum, but perhaps good to have in mind that someone could indeed build this.

kripken commented 3 years ago

I apologize if this is a digression (but maybe it can help?). It seems to me that there are really two use cases here:

For the Web interop use case, it seems natural to me to use an externref to a JS string anyhow. As @lukewagner suggested in the past, pre-importing of JS methods for string operations could make that fast, if a wasm module needs those methods. That is, this path would not use interface types. Looking at it from another angle, interface types works under the assumption of shared-nothing, so the normal thing is to copy a string when crossing the boundary. But for JS interop we want to use actual JS strings without copying - like languages compiled to JS do today. For wasm to really fit into that JS ecosystem of languages, it needs to use externref.
For the wasm components use case, I agree UTF8 is the cleaner solution. Yes, that means some amount of copying and conversions for some languages, but that's expected in the shared-nothing model, so we may as well use the best encoding, UTF8. Perhaps the type could even be renamed from string to utf8 to make that explicit?

(Btw, note that wasm on the web would use option 2 when it wants to use interface types to communicate with wasm components; point 1 is just about interop with JS.)

I think there is no perfect way to have optimal web interop as well as optimal UTF usage at the same time / with the same code. We will have the downside of people compiling differently when JS interop is their preference, but that will be the same with a wasm VM embedded on the JVM or CLR I believe, where, again, using the native string may be important for speed and compatibility.

So in practice we may see these two patterns of usage, externref for "host strings" and IT strings for "generic string type among wasm components". It would have been really nice if we could have just one pattern, of course. But wasm is a low-level assembly, and overlapping solutions can make sense to get optimal behavior on different use cases.

dcodeIO commented 3 years ago

Sadly, option 1 is not feasible yet, and probably won't be for a long time. One pain point for example is that modules typically ship with static string data, and without Interface Types, and without GC, and with neither of them considering WTF-16 a proper encoding, including no ETA on Type Imports (or knowing what these actually can do to init strings), I am not seeing anything of significance happening in the near future. This is all connected at the end of the day, and simply choosing UTF-8 now because it is "the best" encoding, even though I presented reasonable concerns and the solution to it is rather trivial, would be something I would not understand. Given how important strings are, this sentiment has the potential to ruin it for some languages, including AssemblyScript, from double re-encoding to occasional breakage and whatnot, and if that's what the open standard chooses, even though it asked me for my first-hand implementer feedback, then I really don't know what I am doing here anymore.

kripken commented 3 years ago

Sadly, option 1 is not feasible yet, and probably won't be for a long time.

I don't follow your example with static string data. Can you not use a TextDecoder to generate a string from them, then store that in an externref? Can even do it all from wasm with the proper imports and use of Reflect.

Overall I believe option 1 works today (in browsers with reference types), just without inlining it is somewhat slow if you do operations on the string.

If you mean it is not feasible without inlining, then I think a simple inlining proposal - which is just a performance hint, parallel to branch-hinting - could in fact ship before Interface Types. (Not that the order really matters, but just to respond to your concern.) I may work on this myself, in fact, given the multiple use cases that have come up around it.

I'm not ignoring the rest of your comment, but I don't have the unicode expertise to comment on the UTF-8/WTF-16 details here. My point is that, regardless of that debate, if I were compiling a language to wasm GC with the goal of JS interop, then I would use externref for strings. (Perhaps Emscripten will do so too, especially once inlining happens, although wasm GC is in a much better position to benefit from this than linear memory languages...)

dcodeIO commented 3 years ago

I don't follow your example with static string data. Can you not use a TextDecoder to generate a string from them, then store that in an externref?

Sadly no, TextDecoder does not support JavaScript string encoding. Also, we would have to do that during instantiation somehow, which will fail if memory is not imported respectively no _start is used. And even if we would decide to do that, there is a hefty performance hit attached to that, in that we end up with redundant casts from externref on every string operation, including overhead from crossing the boundary each time, plus we cannot even properly import String instance methods without a lot of glue code.

On the other hand, if we could just pass strings over Interface Types boundaries without double re-encoding and potential breakage, we'd be much better off, also because we can then keep specializing our string methods for the static type system we are using. And all it takes to achieve that is to make well-formedness the default, while giving us the option to lower a string without enforcing surrogate replacement. It's that simple :(

lukewagner commented 3 years ago

If timeliness is the concern, my expectation, which I included in the CG pres, is that browser native implementations will trail significantly behind actual use of components and thus, for the first few years, the way components are used on the Web will be via AOT transpilation into core wasm + JS API performed by bundlers (which is the same path ESMs took). So interface types is by no means the "fast path" for solving the JS-specific problems of how to efficiently produce and consume JS strings; proposing additions to the JS API could be both faster and more appropriately scoped.

dcodeIO commented 3 years ago

All I am asking for is that Interface Types considers JavaScript/C#/Java/AssemblyScript string encoding important enough to avoid unnecessary double re-encoding and not to force potential breakage on these languages. If it refuses to account for this concern, we'll enter an era of C, Rust and similar languages being the only ones viable for a long time. I don't know how you think about this, but this has the potential to backfire so very badly that I cannot even describe. And AssemblyScript would probably still find a way around it, say by switching to UTF-8 in very unpleasant ways only because JS-like languages are simply not viable, but still, wow. What's even happening here after I informed you about the problem for 3 1/2 years :(

tlively commented 3 years ago

+1 for IT not being a fast path for anything on the Web. In that same vein, there's plenty of time and appetite to take implementor feedback into account both for the MVP and for follow-on extensions as implementations start to appear. @dcodeIO, I think we all understand that IT will be less appropriate (in terms of correctness, performance, and convenience) for the use cases you are concerned about and that it will be more appropriate for other use cases instead. I don't think it's clear that this will be as catastrophic as you're arguing it will be, though. If it does turn out to cause as much developer pain as you anticipate, we certainly can and should act on that.

In short, I think this conversation would benefit greatly from developer feedback on real prototypes and on an MVP, and I don't anticipate anyone changing their minds without that feedback.

dcodeIO commented 3 years ago

Right, it's neither correct, nor performant, nor convenient, while it is for the first-class club of languages. And JavaScript is not one of them. To me that is catastrophic on many levels :(

At this point, though, I do not know what to say anymore. Thank you for taking the time to discuss this with me :)

kripken commented 3 years ago

@dcodeIO

Sadly no, TextDecoder does not support JavaScript string encoding.

Interesting, and surprising. I see there is a long list of existing encodings (including utf-16) so I don't see why they would object to adding JS encoding. Might be worth starting a discussion, or looking for an existing one, but possibly you already have.

MaxGraey commented 3 years ago

Interesting, and surprising. I see there is a long list of existing encodings (including utf-16)

Only for TextDecoder while TextEncoder supports only UTF8 by default

trusktr commented 3 years ago

AssemblyScript is catching up in popularity (3rd place) among languages targeting WebAssembly for one main reason: the similarity it has with JavaScript, the most widely-used programming language in the world (plus AS is done very well!). This is a huge boon for JavaScript web developers:

(source)

Simple interop will be a great thing to have for the JavaScript-WebAssembly use case, which is definitely growing thanks to AssemblyScript, and if it keeps going at this pace, it is going to pass Rust and C++.

C programmers can write string encoders with their eyes closed; I'd bet it is better to make the string stuff easy for the JavaScript devs right from the get go instead. :blush:

I'm nowhere near an expert on this topic, but that's the feeling I get from reading this thread.

PiotrSikora commented 3 years ago

If we are really worried about silent breakage, though, the fix would be to have surrogates trap instead of producing replacement characters, so the errors could be caught early and fixed easily. I'm open to discussing that more.

If we want to consider string as a list of USVs, then I believe that we have to trap early to avoid silent corruptions and possible breakages when other components are suddenly replaced by a more strict alternatives, which might be hard to troubleshoot with module linking.

hsivonen commented 3 years ago

Use APIs that give you unpaired surrogates with WebAssembly. This appears to be true for keyboard events in browsers currently, and there may be many more anomalies like this in the various languages that we want to compile to Wasm.

This is a fixable bug in how Chromium and Gecko interface with Windows. (They should switch from pre-XP native events to XP-or-later native events.) It's not fundamental to the Web Platform given that Trident/EdgeHTML didn't have this problem on Windows and engines on other operating systems don't need to replicate this oddity.

It would be really bad to design for this Web engine Windows integration bug.

Notably, there is no use case for holding onto the ephemeral DOMString that ends with an unpaired surrogate high surrogate: The DOMString that has a proper USV interpretation will be available in the next moment anyway.

Compile C#/Java/JS-like string manipulation routines to WebAssembly and use them as a module, either from within JavaScript or from other WebAssembly modules. Would currently sanitize ephemeral isolated surrogates early and make it unfeasible. One more concrete use case here is to provide a JSString for interop purposes that can be used in Non-Web environments.

Compile a StringBuilder or similar string utility to WebAssembly and use it as a module. Similar to string manipulation routines, this would currently prematurely sanitize and be unfeasible.

Regardless of the supported value space, as long as the representiation in Wasm memory is either UTF-8 or WTF-8, this seems really inefficient. Realistically, if there were C#/Java-like string manipulation routines operating on this level of granularity and worthwhile exposing to JS, chances are that it would make sense to expose them in a manner that explicitly exposed the 16-bit-code-unit representation.

Also, this use case is for a family of languages that have the same string value space among themselves and isn't an appropriate motivation for interface types for general cross-language interop.

Compile string encoders or decoders to WebAssembly and use them as a module, either from within JavaScript or other WebAssembly modules. Would currently be unfeasible for Unicode-like encodings because of mandatory lossiness.

The UTFs themselves make the loss of unpaired surrogates mandatory, so "Unicode-like" here would have to mean "wobbly" for this argument to make sense, which would make this use case tautological: requiring wobbliness in order to support wobbly encodings, but that's not a use case that would explain why a wobbly encoding would be needed.

dcodeIO commented 3 years ago

For those who haven't seen it yet, we are about to "Poll for maintaining single list-of-USV string type" on August 3rd (i.e. void my concerns / no support for DOMStrings). Also fyi, I have been informed by the chair that the follow-up IT-specific meeting suggested at the end of the previous discussion slot has since been cancelled due to "reluctance to spend more time discussing this than already has been done" by the relevant folks.

I would appreciate if we could at least talk about my suggested solutions first and establish a definitive commitment for UTF-16 support in the canonical ABI before polling, as I think that would lead to a more constructive outcome than what is being proposed currently. IMO it is too early to decide the USV question without.

dcodeIO commented 3 years ago

In particular I'd like to talk about "Integrated W/UTF-any" as an alternative to single list-of-USVs.

Lift "list of Unicode Code Points amending surrogate pairs" but lower "list of Unicode Scalar Values".

Add an optional passthrough option when lowering.

It largely preserves what is proposed here as its default (except conceptionally lifting "List of Unicode Code Points amending surrogate pairs"), but simplifies matters for users and toolchains in that all of the following questions do not have to be asked and answered (by WTF-16 languages in particular) due to not having to statically determine what kind of mechanism to use:

Is the API we are calling a Web API so we can use the Web embedding mechanism?
Is the API we are calling a WASI API so we need to re-encode / resort to string?
Is the API we are calling a compatible API, say same language, C# or Java, so we can utilize list u16?
Is the API we are calling a well-formed API so we need to re-encode / resort to string?
Are we the single caller of the API (in the final module graph) so we can use a single mechanism?
If used in multiple ways, can we duplicate the API gracefully, say to an internal and external variant? Can we update all callers?
What do we do if we have multiple dynamic internal representations that map to either mechanism, but not statically to one?
How do we reliably know all of the above? Annotate in the source language? Ship separate linking meta data?

On the contrary, with "Integrated W/UTF-any", a consumer can categorically set the passthrough option accordingly and use the same ecosystem-wide string type everywhere, from the Web embedding to separate compilation to linking with known or unknown modules. A language like Rust would omit passsthrough (rely on USVString-to-DOMString conversion which is unavoidable), while a language like AssemblyScript would set passthrough (preserve DOMString where possible), and both can interface with each other with zero knowledge of the other, while reliably mitigating potential breakage where it is possible.

Wouldn't that be generally preferable in the current and future landscape of languages we want to support?

lukewagner commented 3 years ago

I believe that, when you ask what is the net effect of such a design, where you have an ecosystem of components with some setting passthrough, some not, some producing/consuming surrogates, some not, you end up with all the same problems outlined in the original comment. I think it's essentially equivalent to adding surrogates to string.

dcodeIO commented 3 years ago

It is unavoidable I think that there will be challenges in between languages utilizing different semantics, that's unfortunate but also expected in a cross-language standard. But if we can avoid a bunch of (potentially statically undecidable) problems, as outlined above, for or in between languages utilizing the same semantics as JavaScript, why wouldn't we when list-of-USVs languages remain unaffected?

lukewagner commented 3 years ago

I don't think it's unavoidable or expected; rather it's the norm when communicating between unknown parties using standard formats/protocols (JSON, Protobufs, etc), which is the assumption to which the component model is scoped.

trusktr commented 3 years ago

I believe that, when you ask what is the net effect of such a design, where you have an ecosystem of components with some setting passthrough, some not, some producing/consuming surrogates, some not, you end up with all the same problems outlined in the original comment. I think it's essentially equivalent to adding surrogates to string.

Although the problem moves around, it seems this passthrough idea at least provides benefits for languages using the same data structure (f.e. AS -> JS -> Java). It offers performance benefits to all groups of languages (each group using a particular string format), not just one particular group.

dcodeIO commented 3 years ago

This is not a file format or networking question, Luke, I thought we'd been over that since you brought up HTTP APIs, or Dan brought up that "nearly 100% of Web content is UTF-8". It's really the same not particularly convincing argument. Say, if you were designing a Component Model for something like C#, what you are proposing here makes little sense. People would shake their heads if you'd bring up ~~JSON~~ or Protocol Buffers as an argument there. In fact, right now your argument only works if one side of the equation is always, say, Rust, which I think we all can agree is a very narrow perspective on this, since it leads to anything from annoyances to hazards once you have, say, C# and C#. We really have two semantics in the wild here, wobbly or not, and only supporting the most restrictive of them is not viable in a cross-language standard, because it potentially breaks half of the ecosystem even if it only interfaces with itself, which is likely the common case.

But to take a step back, what do you think about my technical respectively design concerns above btw, I think you haven't addressed these yet.

conrad-watt commented 3 years ago

In fact, right now your argument only works if one side of the equation is always, say, Rust, which I think we all can agree is a very narrow perspective on this, since it leads to anything from annoyances to hazards once you have, say, C# and C#.

Given @lukewagner has expressed interest in supporting UTF-16, I don't think it's fair to say that the USV model is purely Rust-driven. That said, I hope the planned CG vote is a first step towards focussing the discussion on UTF-16 support as a solution to our current debate.

We're focussing on the argument about isolated surrogates a lot here, but as you said in the previous meeting, a significant part of your discomfort with the current IT MVP comes from the current double-encoding at the component boundary, independent of whether isolated surrogates are sanitised.

We really have two semantics in the wild here, wobbly or not, and only supporting the most restrictive of them is not viable in a cross-language standard, because it potentially breaks half of the ecosystem even if it only interfaces with itself, which is likely the common case.

I don't think this is the common case - I'd argue that the component model is primarily useful when a language is not interfacing with itself (or at least, does not know whether it is interfacing with itself). Otherwise, one has to ask what the value is in using components as opposed to just linking/composing the underlying code using regular mechanisms.

It makes sense as a general principle to choose a semantics/encoding that does not rely on assumptions about the internal implementation details of the component. In general, even with something like passthrough, a component outputting a string has no control over whether the component at the other end will sanitise isolated surrogates.

dcodeIO commented 3 years ago

We can talk about this more broadly if you prefer, but I would still be very interested in your and Luke's perspective on the potentially unresolvable challenges for WTF-16 languages in particular, that I think I reasonably outlined above. These challenges do not apply to UTF-8 languages, but are very relevant for those who cannot just use the string type. Or, say you have multiple internal encodings, which mechanism are you going to pick (Web mechanism, string, list u8, list u16, list i32 as an escape hatch?), and how are you going to decide what to duplicate, what to amend? Can you even update your callers? This is a non-issue for those who can just use string, but that doesn't make it less important, so I think there is a strong motivation to aid others as well? I mean, even post-MVP adapter functions won't help there if the char type is already broken beyond repair by means of mandating the most restrictive semantics of them all in what is supposed to serve many languages. Right now this really only serves a select few while not only imposing potential hazards on others, but also imposing potentially undecidable challenges upon them.

If we are not talking about that but instead continue to distract from it, then what is this all good for? I mean, sure, spec work can be political (perhaps it shouldn't be so much), but I think actual reasonable technical concerns should at least be talked about? Just look at the answers I have received once again on that, is this fair?

Speaking of fairness, perhaps you have forgotten, but a year ago this and this was already acknowledged and we broadly agreed upon that, everyone was happy, the AS community relieved, just so that the Interface Types Explainer was then changed a couple months later (even UTF-16 is gone) towards the complete opposite and still states for example this nonsense:

While the canonical representation of all the numeric types are obvious, due to their fixed-power-of-2 sizes, char requires the proposal to choose an arbitrary character encoding [instead of the less arbitrary one]. To match the core wasm spec's [file format] choice of UTF-8, and the more general [but completely irrelevant] trend of "UTF-8 Everywhere" [whose authors openly state that they hate JavaScript but Luke thought originated in Web standards], this proposal also chooses UTF-8 [just so]. By choosing a canonical string encoding happy path [for some] while providing a [totally not] graceful fallback to [totally not] efficient transcoding, Interface Types [that once was envisioned as a WebIDL-like mechanism] provides a gentle pressure [which is totally not its business] to eventually converge [to Rust semantics] without performance cliffs in the meantime [which is untrue, but let's pretend].

And now we are a step further than that and looking at a scope change towards the Component Model, that carefully removes my concerns from scope. So please excuse if I am very careful to see UTF-16 as a given just yet, without a definitive commitment, also because the justification Luke provided there that is totally not an acknowledgment of my concerns has already been attacked. If anything, experience has shown that the amount of backstabbing in the Wasm spec, especially when WG members are involved, is just out of this world, so it would be much more helpful to do as I suggested during the discussion and have a vote on UTF-16 support as a basis we can agree on, or to start an actual Interface Types subgroup so non-WASI voices can be heard (I haven't received an ongoing invite as per my request to their meeting, which surprised me given how the process is supposed to work and that IT seems to be largely motivated by WASI) instead of scheduling just another potentially fatal vote on top of the faint hope being induced that this will just remain a potential hazard justifiable with Unicode hygiene and other totally questionable arguments, but at least not inefficient or inconvenient for the other half of the ecosystem.

Do you like that more, or are we going to talk about my actual technical and design concerns and their implications on non-UTF-8 languages now and refrain from all the nonsense?

conrad-watt commented 3 years ago

We can talk about this more broadly if you prefer, but I would still be very interested in your and Luke's perspective on the potentially unresolvable challenges for WTF-16 languages in particular, that I think I reasonably outlined above

So please excuse if I am very careful to see UTF-16 as a given just yet, without a definitive commitment, also because the justification Luke provided there that is totally not an acknowledgment of my concerns has already been attacked. If anything, experience has shown that the amount of backstabbing in the Wasm spec, especially when WG members are involved, is just out of this world, so it would be much more helpful to do as I suggested during the discussion and have a vote on UTF-16 support as a basis we can agree on

Are you saying that if UTF-16 support were to be committed to, you would no longer consider the issues you've raised about isolated surrogates to be "unresolvable"?

dcodeIO commented 3 years ago

I am saying that I would at least be relieved that we have agreed on a mechanism that accounts for an important part of my concerns, in turn lessening my disagreement. I still believe that we should talk about this issue as well, though, but I think we can do so less strongly then.

conrad-watt commented 3 years ago

Well here is my current perspective. The main thing you want is some UTF-16-style lifting/lowering option, to avoid double encoding at component boundaries.

Because you are concerned that the current interface types MVP will support only UTF-8, you are using the issue of isolated surrogates as an argument against only UTF-8. However, arguing for specifically WTF-16 has the additional effect of interfering with the planned IT-level model of strings as list-of-USV, which as @lukewagner outlined in the OP is not purely tied to a specific choice of string encoding.

If we supported UTF-16, would you be more receptive to the argument that expecting isolated surrogates to be preserved across the component boundary is a hazard (since some components may sanitise, and we want to stay language-agnostic at the component level)? We are spending a lot of time on this argument, but it now seems to me that a significant reason you are holding onto it is because you are worried that conceding will lead to UTF-8 only.

To be explicit about how my personal opinion has evolved, we previously had private conversations where I agreed with you that WTF-16 IT support seemed to be a good idea. However, I do buy Luke's argument in the OP that list-of-USV is the right abstract model for strings at the component boundary. So my current hope is for UTF-16 lifting/lowering support.

dcodeIO commented 3 years ago

Those reading my posts over the last couple of years may be able to confirm that my comments are very rarely political and I am generally stating my honest technical opinion, something not everyone is allowed depending on their company's interests. This remains true here as well, and I think makes me a very useful resource to get this right for all of us.

So, what do you think, shall we vote on UTF-16 first and delay the decision on USVs until an informed decision can be made?

conrad-watt commented 3 years ago

You've suggested above that your concerns about isolated surrogates would be stated less "strongly" if we supported UTF-16 lifting/lowering, even though (well-formed) UTF-16 support is orthogonal to the current issues raised regarding isolated surrogates. This isn't a purely technical argument.

That being said, I would be very happy if we had a vote on commitment to UTF-16 support soon.

EDIT: I appreciate that this conflation may not be deliberate, but some wires are definitely getting crossed here

linclark commented 3 years ago

This is a bit of an aside, but I think it's important to clear up.

or to start an actual Interface Types subgroup so non-WASI voices can be heard (I am not allowed to listen in to their meeting even, which is totally a foul given how the process is supposed to work and that IT is essentially been specced there)

I'm unclear on why you believe you aren't allowed to join the WASI meetings. When you requested an invite before the May 20 meeting about the Canonical ABI’s impact on WASI, I did give you a warning (below), but then invited you to that meeting, and you attended.

Hi Daniel,

Just so you know, Dan won’t be touching on any issues related to string encoding. I know that is an important issue to you, but it is uninteresting in the context of the WASI subgroup’s work—WASI will simply use whatever string encoding the Wasm CG settles on. This meeting’s discussion will focus on resources, handles, and push/pull-buffers, which actually do have interesting ramifications for our work.

I am open to inviting you to this meeting, but I’m going to be frank—I have concerns given your previous behavior. In particular, I’m concerned with how you've responded to some of the requests from community members (including your response to one of the CG chairs themselves) that you discuss technical issues in more productive and less personally antagonistic ways.

It is important to me as co-chair of the WASI subgroup that we act in accordance with the CoC and keep things productive, and I will take an active hand in making sure that is the case.

Before I invite you, I need to know that if I ask you to modify your behavior during the course of the meeting (e.g. if I send a private message saying “make room for others in the conversation” or “don’t use hyperbole“) that you will indeed modify your behavior accordingly.

If you feel you can commit to that, then I can add you to the event when I’m back at work on Monday, at which point it will show up on your calendar.

Sincerely, Lin

Additionally, as I made extremely clear in that message, as Dan reiterated in the meeting, and as Dan has repeatedly pointed out in the issue queue, the WASI subgroup is not discussing string encodings (which you can verify by looking at the meeting notes). I'm not sure how we can make it more clear that string encoding (and more generally, any specification of IT) is outside the scope of the WASI subgroup.

dcodeIO commented 3 years ago

Quoting from my message to you:

Subject: Attending WebAssembly WASI subgroup meetings

Hello Lin,

I would like to listen in to the WASI subgroup meetings, ...

I have been fouled so many times by now that I am very careful, and after inviting me to just one meeting instead, I noticed that you blocked me on Twitter (I unconditionally respect that), that if I would have contacted you again I would have risked to be called out for circumventing that. So I didn't. Is this all a misunderstanding maybe?

For completeness, here is my response to your message above:

Hello Lin,

I can commit to that, and want to contribute to the best of my ability in accordance with the CG's processes and values. I do not plan to bring up any points during the discussion, unless asked to do so of course, but after having a call with Luke I felt that in the past I was missing some of the context that would have helped me to understand certain aspects better. The discussion on "Impact of the canonical ABI proposal on WASI" may also be relevant for a presentation I am currently preparing, so I figured now may be a good time to ask for an invite, i.e. just in case I have a question.

Respectfully, Daniel

Other than that, I and others are of the impression that WASI is motivating Interface Types a lot, perhaps more than is healthy for a Web standard, especially now with the scope change towards the Component Model, which is why I am interested in listening in to the topics being discussed at WASI, be it only to reduce future friction. Establishing an Interface Types subgroup would probably be preferable, and I in fact suggested that in the last discussion slot on the topic, but it wasn't acted upon and instead a potentially ending vote was scheduled. Perhaps that's also a misunderstanding?

linclark commented 3 years ago

I blocked you on my personal Twitter account. That does not reflect the intentions of the WASI subgroup, nor does it impact how I’ve treated you in the context of the CG’s work.

If your concern was that you’d made me uncomfortable, it’s not clear to me why you think calling my actions “totally a foul” in a heated thread would be a better response than emailing me to request an ongoing invite.

ttraenkler commented 3 years ago

I agree a written public exchange is unlikely to clear up the underlying misunderstandings and trust issue. Since I can relate to the motivations and feelings on either side and believe in the common vision, I feel motivated to offer to help mediating and translating those if both sides put trust in me. I assume good intentions and a communication and trust issue at heart so if welcome I would offer to mediate in a call and getting back to being comfortable for everyone.

lukewagner commented 3 years ago

I agree with @conrad-watt that it's important to decouple the question of string semantics, which is the subject of this thread, from the question of string encodings, which is the subject of #136 and purely concerns performance/optimization and not the problems raised in this thread. On the latter, I still very much agree with the proposal in this comment but I didn't want to include UTF-16 in the vote because it's an orthogonal issue and I wanted to focus on what I think is the much higher-order bit. But if these issues are entangled in folks' minds, I could add it to the CG poll if it actually lets us agree on the semantics issue and reach a greater degree of consensus in the short term. But if we're still going to continue arguing about including surrogates in the semantics, then I think there's little benefit in pulling in all the extra technical context about UTF-16.

dcodeIO commented 3 years ago

@linclark I am sorry if I made you feel uncomfortable with my public Twitter posts or otherwise. It's just that I also mentioned the discrepancy with the invite to the CG chair on June 22nd and a high-level contact at your company on June 20th, which left me guessing why it has not been resolved and reminded me of past issues. My original comment was not targeted at you personally, rather at the situation at large that I experience as suboptimal. I see that my phrasing was not helpful and edited my comment.

PiotrSikora commented 3 years ago

@dcodeIO as others have already mentioned, I think you're unnecessarily mixing two separate issues (Unicode encoding and allowing opaque byte strings), and using cross-language interoperability and performance arguments (IMHO, valid arguments for allowing both UTF-8 and UTF-16, as discussed in #136) to push agenda for allowing opaque byte strings, which adds confusion and focuses to the wrong issue.

As for allowing opaque byte strings, why exactly do we need a string type to pass non-string values? If a function wants to pass random bytes, then it should use list u8/u16/u32 (bytes?) instead, otherwise we're polluting the whole ecosystem with ill-formed UTF-{8,16} strings, becaues of a few legacy offenders.

Your argument thus far has been that we're splitting ecosystem in two (well-formed UTF-8 vs ill-formed UTF-16), but if we ignore Unicode encoding for a moment, and focus on opaque byte strings alone, then the split is really only between well-formed Unicode and JavaScript libraries using DOMString instead of ArrayBuffer for passing opaque byte strings.

Is the JavaScript ecosystem really not using ArrayBuffer for that purpose at all? Is there a limited number of libraries that could be fixed so that we don't need to allow ill-formed UTF-{8,16} strings? Basically, could this be fixed at the library-level instead of in Wasm?

Perhaps we could start with string being well-formed Unicode, try to fix offending libraries, and only plan on allowing opaque byte strings iff the ecosystem is unfixable? Relaxing restrictions afterwards is much easier than adding them.

dcodeIO commented 3 years ago

Thank you for your detailed response, appreciate it :)

I think what you describe is exactly the culprit, in that some strings are not valid according to the Unicode standard, but are strings according to the respective language standards. I am not sure if the distinction between valid and not-so-valid, or byte string and text, non-string and string, or arbitrarily restricted and idiomatic when looking at it from the opposite perspective, is making much of a difference. That's mostly language as far as I am concerned but ultimately dilutes the technical challenge at hand. For instance, what JS, C#, Java and others use could also be named "relaxed", "lenient" or "practical" not "ill-formed", but that's just words.

My argument is exactly to not split the ecosystem in two by making a choice that only serves half of the ecosystem while putting the other half at risk, but to reasonably support both perspectives. And I am as sympathic as everyone else to making the Unicode standard's perspective the default, and to steadily educate the world about the concept of Unicode Text respectively list of Unicode Scalar Values. Wasm, however, is not the place for the sledgehammer in my opinion, as it risks tossing a bunch of very popular languages, that cannot be "fixed" without throwing exceptions where there previously were none or replacing their string APIs with new ones ("your bug is someone else's feature"), onto the scrap heap of programming language history. As you mentioned earlier, security for example goes a long way when silent data corruption is in play, as goes fear, uncertainty and doubt in our minds, say when documenting the discrepancy in some languages, but not others.

For AssemblyScript in particular, this may ultimately play out in us switching from a dependency on Interface Types to a dependency on Reference Types, in turn rendering it useless in places where the JS standard library is absent respectively where there is no external GC. And while Reference Types seems like a reasonable choice for a languages like AS at first, it also risks killing the AssemblyScript project as a whole as the tools it has developed in good faith according to Wasm's original vision (and I think proper support for DOMString and excellent interop with the Web always was and still is a very reasonable expectation) would become useless to our most valuable sponsors, like Fastly. As such this is an existential threat to us that I can sadly only break consensus on in the hope that we can ultimately establish a compromise that serves all of us.

ttraenkler commented 3 years ago

The concept of a component is a fundamental language agnostic building block for a polyglot module ecosystem so it of course has to be rock solid and portable and it is only natural that this is a cornerstone of the discussion.

Thinking forward strings crossing the component boundary being modeled as lists of USVs make sense - as a lowest common denominator to communicate between languages without losing information on the side of languages not being able to represent implementation details of a more specific encoding in their native string type.

Apologies if I missed sth, but It was not so clear for me from the presentation given when the component model was introduced so I wonder if we could clarify relations between components and other concepts like adapter modules, modules, interface types and linking in the relevant context for deciding this issue? My hope is this opens a space for a solution that neither breaks the valuable strong guarantees of components nor the semantics of existing APIs when exposing them as module interfaces inside of components only.

To recap, in practice directly exposing existing string APIs that rely on implementation details like lone surrogates at the component boundary is not possible without a subtle breaking change of semantics as lone surrogates cannot be expressed as USV. Fixing this diligently would force a careful review or rewrite of every line of code affected. This will not be possible in the general case. Reference types are also not a portable solution to share string types with lone surrogates between modules since they are host specific, so losing compatibility with hosts not having these types. Lone surrogates might be a rare edge case, but they seem to be a practical issue across system boundaries: https://github.com/WebAssembly/wasi-filesystem/issues/17 https://github.com/rustwasm/wasm-bindgen/issues/1348

While on the component boundary it is indeed important to be language agnostic, inside of components as mentioned in context of shared everything linking language specific encodings will be unavoidable. The rules for string encoding at a component boundary are strict for a good reason.

I wonder though if interface types could be defined not only on the component boundary, but also inside of a component between modules where you do not expect to cross a system boundary and thus system specific string encodings would have their natural place. Would this not solve the lone surrogate case to place modules needing to interface with those inside of the same component? If strings with lone surrogates could be expressed as interface types between modules but inside of components, they would not need to cross a component boundary.

This would address backwards compatibility while containing lone surrogates to be an implementation detail of a component. In comparison to making this a compiler toolchain problem, modules could become much smaller and portable and you could leave compiler specific toolchains behind for linking which could be quite attractive even for language specific code or code with shared ABIs. This would keep the component model intact going forward while solving backwards compatibility and still provide a nice linking story for a wasm module ecosystem that does not require a language specific toolchain.

To me this seems like a natural layer between components and modules. Maybe this is what the adapter modules are about?

fitzgen commented 3 years ago

Lone surrogates might be a rare edge case, but they seem to be a practical issue across system boundaries: ... rustwasm/wasm-bindgen#1348

As @hsivonen mentions, that wasm-bindgen issue was really a bug in browsers' IME implementations (Firefox bug, Chrome bug). I agree with what Henri is saying in that issue thread: treating all strings as potentially containing lone surrogates to work around this particular bug that only occurs in some browsers is overkill. That's why we closed that issue without switching the whole wasm-bindgen ecosystem away from UTF-8.

dcodeIO commented 3 years ago

Perhaps the important aspect to learn is that there is clear evidence that things will break, and that it is impossible to know (or to nudge) all the places since we are going to compile many languages and their existing ecosystems to Wasm, or even run pre-existing bytecode on compiled VMs. An ecosystem-wide list-of-USV string would even go a step further and restrict existing languages where these previously were not, creating never seen before surface area for all sorts of hazards that didn't exist prior.

lukewagner commented 3 years ago

@ttraenkler If we're thinking about intra-component interactions, I think the situation is much simpler because we're focused on a particular set of languages following a common ABI and thus the ABI can use shared linear memory or GC memory or even do shared-nothing using Interface Types' (list u8)s or (list u16)s as the mechanism for passing ABI-defined encodings of language-specific data-types. This was the point I hoped to capture in this diagram. A first example of this is the C-style ABI in tool-conventions which allows C, C++ and Rust to share, e.g., function pointers. A new ABI could focus on a different set of languages and thus reasonably specialize to them (e.g., allowing surrogates). But in any case, I don't think this requires adding anything new to Interface Types, since the intra-component ABI just needs raw primitives for sharing or copying memory.

PiotrSikora commented 3 years ago

For instance, what JS, C#, Java and others use could also be named "relaxed", "lenient" or "practical" not "ill-formed", but that's just words.

You keep mentioning JavaScript, Java and C#, but I don't think the latter two are affected as much.

Notably, both Java's String and .NET's string are UTF-16 encoded (I don't know whether this is actually enforced at runtime), both languages are strongly typed, and use byte[] arrays for storing binary data, so I seriously doubt that their ecosystems are using string types for passing binary data in significant numbers.

My argument is exactly to not split the ecosystem in two by making a choice that only serves half of the ecosystem while putting the other half at risk, but to reasonably support both perspectives.

But you're suggesting exactly that, since if we allow Wasm string type to pass opaque byte strings, then all languages that require well-formed UTF-{8,16} and enforce it at runtime will either trap or corrupt the data, and that's a language-wide issue.

On the other hand, if we require Wasm string type to be a well-formed UTF-{8,16}, then only JavaScript libraries that use DOMString to pass opaque byte strings are going to be negatively affected by it.

Since you're most familiar with this issue, could you help us quantify the impact of such decision? What's the number of popular libraries that would be affected by this at the interface layer? If it's a limited number, then surely we could work with authors of those libraries to add support for APIs using ArrayBuffer. Not ideal, but alternatively, we could add a list of well-known functions that should use list u16 instead of string in AssemblyScript and make the conversion there?