STRING i32 pairs for UTF-16LE

dcodeIO commented 6 years ago

Regarding

JavaScript hosts might additionally provide:

STRING | Converts the next two arguments from a pair of i32s to a utf8 string. It treats the first as an address in linear memory of the string bytes, and the second as a length.

Any chance that there'll be support for JS-style strings (UTF-16LE) as well? I know this doesn't really fit into the C/C++ world, but languages approaching things the other way around will most likely benefit when not having to convert back and forth on every host binding call.

dcodeIO commented 5 years ago

Since this hasn't received any comments yet, allow me to bump this: I am still curious if UTF-16 strings can be supported. In AssemblyScript's case all strings are UTF-16LE already, so only having the option to re-encode (potentially twice if the bound API wants UTF-16) does seem like it should be taken into account.

fgmccabe commented 5 years ago

This specific operator has not been discussed. In part that is because we are still at a more 'skeletal' level in the effort.

However, although definitely possible, the bar for coercion operators is going to look something like this: a. Is the operator consistent with the requirements of the majority of hosts (i.e., mostly browsers at this point but with a definite leaning towards non-browser implementations)? b. Is the operator consistent with the majority use case for representing (in this case) string values?

If most hosts would require copying unicode16 into urf8 anyway, you may have trouble with (a).

But, essentially, it's a bit early to consider committing to any operators at the moment.

On Sun, Jun 30, 2019 at 1:56 AM Daniel Wirtz notifications@github.com wrote:

Since this hasn't received any comments yet, allow me to bump this: I am still curious if UTF-16 strings can be supported. In AssemblyScript's case all strings are UTF-16LE already, so only having the option to re-encode (potentially twice if the bound API wants UTF-16) does seem like it should be taken into account.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/WebAssembly/webidl-bindings/issues/13?email_source=notifications&email_token=AAQAXUHDAKARFH52F524VV3P5BYNVA5CNFSM4EJGLDYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY4IFZQ#issuecomment-507020006, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQAXUEQUF23TJ7APNWB5ODP5BYNVANCNFSM4EJGLDYA .

-- Francis McCabe SWE

dcodeIO commented 5 years ago

If most hosts would require copying unicode16 into urf8 anyway, you may have trouble with (a).

To me it looks like having such an operator can lead to significantly less work where UTF-16 is already present on both sides of the equation, while implementing the APIs in any case where either side is UTF-8 can easily be implemented by reencoding conditionally. Hence, the module would chose the operator that fits its internal string layout ideally, and the host would do whatever is necessary to make it fit into theirs. This leaves us with those cases:

Both UTF-8: Essentially memcpy
Both UTF-16: Essentially memcpy
One UTF-8, the other UTF-16: Reencoding once

while avoiding the very unfortunate case of

Module UTF-16, host UTF-16: Reencode twice because UTF-8 is all the bindings understand

But, essentially, it's a bit early to consider committing to any operators at the moment.

I see, yet I thought it might make sense to raise this early so once committing to operators, this case is well thought through :)

jgravelle-google commented 5 years ago

while avoiding the very unfortunate case of

Module UTF-16, host UTF-16: Reencode twice because UTF-8 is all the bindings understand

Yeah that would be unfortunate.

Any chance that there'll be support for JS-style strings (UTF-16LE)

The real question is, can we skip JS entirely? At which point, what does the host API use internally? For example a similarly bad outcome would be

Source UTF-8 -> JS UTF-16 -> Web API UTF-8

So I think the sanest way to handle that is with a declarative API on the bindings layer. Which is what you said earlier:

Hence, the module would chose the operator that fits its internal string layout ideally, and the host would do whatever is necessary to make it fit into theirs.

So the higher-level point is that we should be able to adequately describe the most common/reasonable ways to encode strings, so that we can minimize the number of encodings in the best case.

But, essentially, it's a bit early to consider committing to any operators at the moment.

Agree + disagree. On the one, it's all in the sketch stage at the moment, where we're feeling out the rough edges. So from a managing-expectations point of view, this makes sense to say.

On the other hand, it's kind of incongruous to say "everything's up in the air, so don't raise any design issues." I don't think that was the intent, but that's kind of how it sounded. A more real translation of how I heard it was "don't worry about this now, we'll figure it out later." To which I would say, as a general principle, that yes we'll figure it out later, but we should raise it now to figure out if we should worry about it now. Especially because multiple people can think about different bits of the spec asynchronously.

jgravelle-google commented 5 years ago

Also something I should mention explicitly:

I find it incredibly likely that we will default to 1 binding expression per :snowman:-type per wasm representation (e.g. 1 for linear memory and 1 for gc), which is to say the MVP of :snowman:-bindings will have one binding expression per type, because gc will probably not be shipped yet. On that basis, we will probably start with only UTF-8-encoding (I imagine we will drop the utf8-cstr binding too, for similar reasons).

My general mental model here is that we can always add bindings in the future as we find a need for them. And it may be the case that in practice, the re-encoding from UTF-16 isn't enough of a bottleneck to be worth it. Unless it is, at which point we can add that binding, and it will be more obviously useful because we'll have much more real-world data.

Also for AssemblyScript specifically, would it be reasonable to change the internal string representation from UTF-16 to UTF-8 in the presence of :snowman:-bindings? It is, after all, "Definitely not a TypeScript to WebAssembly compiler" :smile:

dcodeIO commented 5 years ago

And it may be the case that in practice, the re-encoding from UTF-16 isn't enough of a bottleneck to be worth it. Unless it is, at which point we can add that binding, and it will be more obviously useful because we'll have much more real-world data.

At the end of the day we are building just tools here and one can't know the use case of everyone. Like, any use case extensively calling bound functions with string arguments would hit this and my expectation would be that this'll happen anyway (in certain use cases). Like, if we'd wait, this'll surface sooner or later, so it can as well be addressed from the start, instead of having to tell everyone running into this that their use case is currently not well-supported even through we did see it coming. Especially since specification and implementation of new operators can take a long time again.

Also for AssemblyScript specifically, would it be reasonable to change the internal string representation from UTF-16 to UTF-8 in the presence of ⛄️-bindings? It is, after all, "Definitely not a TypeScript to WebAssembly compiler" 😄

I'm sorry, the "⛄️-bindings" term is new to me. Would you point me into the right direction where I can learn about it? :)

Regarding UTF-8: In fact we have been thinking about this but it doesn't seem feasible, because we are re-implemeting String after the JS-API (with other stdlib components relying on it) and going with something else than UCS-2 representation seems suboptimal since the API is so deeply rooted into the language that mimicking UCS-2 would cost too much perf-wise. After all we are trying to stay as close to TS as reasonable to make picking up AssemblyScript a smooth experience. Also would like to note that this isn't exclusively an AssemblyScript thing, as other languages are using UTF-16LE as well, like everything in the .NET/Mono space.

jgravelle-google commented 5 years ago

Like, if we'd wait, this'll surface sooner or later, so it can as well be addressed from the start, instead of having to tell everyone running into this that their use case is currently not well-supported even through we did see it coming. Especially since specification and implementation of new operators can take a long time again.

It's ultimately a tradeoff. My thoughts here are that it will be strictly easier to spec and implement a bindings proposal that defines 8 operators, as opposed to one that defines 40. So we could just add UTF-16, but we could also just add C-strings and we could just add Scheme cons-list strings and we could just add Haskell lazy cons thunks, and so on. So for MVP I think we need to be really strict as to what exactly is "minimal", and in this context Minimal means "we can reason about strings at all".

We also need to balance the "viable" portion. Originally I was thinking we should avoid reasoning about strings and allocators at all, due to the complexity they add. Further discussion on this (see: https://github.com/WebAssembly/webidl-bindings/issues/25) made me realize that not having an answer for allocators would compromise the viability of the proposal entirely. On that basis, not having UTF-16 support from day 1 is unlikely to leave the bindings proposal dead in the water.

By means of analogy, I would rather we ship anyref without waiting for the full gc proposal, because anyref on its own is a very enabling feature. It is in many ways suboptimal, but it it is more useful than what we had before. On that basis, I want to be very cautious about adding scope to the bindings MVP, especially when that scope is separable to a v2 that describes an expanded set of binding expressions.

I'm sorry, the "snowman-bindings" term is new to me. Would you point me into the right direction where I can learn about it? :)

Sure, @lukewagner presented at the June CG meeting, and here's the slide deck: https://docs.google.com/presentation/d/1wtAknL-UJWDoIgSbyF5paTBSpVVj-fKU4tiHMxJbSzE/edit

tl;dr does this wasm binding layer we're describing need to reason about WebIDL at its core, or is WebIDL another target with a produce/consume pair? If the latter, and we suspect that is the case, then we're free to design an IDL that better matches what we're trying to do, rather than try to retrofit that on top of WebIDL.

Full notes of the accompanying discussion here: https://github.com/WebAssembly/meetings/blob/master/2019/CG-06.md#webidl-bindings-1-2-hrs

Also would like to note that this isn't exclusively an AssemblyScript thing

Didn't mean to sound like I was saying it was :x, sorry. I was thinking that if AssemblyScript was using UTF-16 for easier FFI with JS, then in the presence of something-bindings it would be possible to decouple that ABI. And also that AssemblyScript would probably have an easier time of making that ABI switch than a more-ossified target like .NET, on account of it's a younger platform.

dcodeIO commented 5 years ago

My thoughts here are that it will be strictly easier to spec and implement a bindings proposal that defines 8 operators, as opposed to one that defines 40

Makes sense, yeah. Though, to me it seems not overly complex to have a (potentially extensible) immediate operand on str (/ alloc-str) that indicates a well-known encoding. I'd consider UTF-8, UTF-16LE and maybe ASCII here (not sure), with length always provided by the caller (even if null-terminated), but I'm certainly not an expert in this regard.

By means of analogy, I would rather we ship anyref without waiting for the full gc proposal, because anyref on its own is a very enabling feature. It is in many ways suboptimal, but it it is more useful than what we had before.

I totally agree with the anyref mention, but don't entirely agree on the comparison with encodings. anyref is a useful feature on its own with everything else building upon it, while not addressing encoding challenges on introduction of the feature that would need to deal with it leads to half a feature that unnecessarily limits what certain ecosystems with (imo) perfectly legit use case scenarios like UTF-16 can do efficiently.

Sure, @lukewagner presented at the June CG meeting, and here's the slide deck: https://docs.google.com/presentation/d/1wtAknL-UJWDoIgSbyF5paTBSpVVj-fKU4tiHMxJbSzE/edit

Thanks! :)

So, looking at this slide it mentions utf8 exclusively similar to what we have with WebIDL. Not quite sure how it would solve the underlying issue, that is making a compatible string from raw bytes, if it moves the problem from "directly allocating a string compatible with WebIDL bindings" to "creating a DOMString/anyref compatible with ⛄️-bindings" (if I understood this correctly?). For instance, TextEncoder doesn't support UTF-16LE (anymore), but TextDecoder does.

I'd expect that at some point in either implementation "making a compatible string from raw bytes" will be necessary anyway if the primary string implementation is provided by the module, which is likely. Please correct me if I'm missing something here. Ultimately, the issue doesn't have to be solved in the WebIDL spec, but any other spec solving it would be perfectly fine as well - as long as it is solved.

Didn't mean to sound like I was saying it was :x, sorry. I was thinking that if AssemblyScript was using UTF-16 for easier FFI with JS, then in the presence of something-bindings it would be possible to decouple that ABI. And also that AssemblyScript would probably have an easier time of making that ABI switch than a more-ossified target like .NET, on account of it's a younger platform.

All good, your point makes perfect sense. Just wanted to emphasize that, even if AssemblyScript would make this change, this is a broader problem than what it might look like from this issue alone :)

MaxGraey commented 5 years ago

I thought if WebAssembly implement WebIDL binding it should follow WebIDL spec which support three types of strings: DOMString, ByteString and USVString. Most of WebIDL which relate to WebApi mostly using DOMString which commonly interpreted as UTF-16 encoded strings [RFC2781]. ByteString is actually ASCII and at last USVString which not require concrete encoding format. Addisianal note about USVString from WebIDL spec:

Specifications should only use USVString for APIs that perform text processing and need a string of Unicode scalar values to operate on. Most APIs that use strings should instead be using DOMString, which does not make any interpretations of the code units in the string. When in doubt, use DOMString.

Pauan commented 5 years ago

@dcodeIO I'd consider UTF-8, UTF-16LE and maybe ASCII here (not sure)

UTF-8 was intentionally designed as a strict super-set of ASCII, therefore UTF-8 can be used to efficiently transfer ASCII text.

dcodeIO commented 5 years ago

UTF-8 was intentionally designed as a strict super-set of ASCII, therefore UTF-8 can be used to efficiently transfer ASCII text.

Yeah, tried to be careful there (in regards to C-strings) but the more I think about it the less I believe that this distinction is necessary, especially since any API being bound will very likely be reasonably modern anyway. So that'd leave us with UTF-8 and UTF-16LE. Anything else you could imagine would fit there in terms of "well-known encodings" (in context of modern programming languages)?

MaxGraey commented 5 years ago

@Pauan WebIDL (except ByteString) and Javascript not using ASCII at all. Strings in javascript represented as UTF-16LE by default but v8 for example can represent strings in different ways and encodings internally. For example during concatenation strings can represent as rope structure which flattened to "normal" string before serialization / father conversion or before passing to Web Api. But that doesn't mean we should use rope structure as default structure for string for example. The same with UTF8

dcodeIO commented 5 years ago

Side note: USVString looks like it can be described in terms of UTF-32 (not sure if that makes sense as I don't know anything using it for its internal representation). But maybe the least common denominator is UTF here?

MaxGraey commented 5 years ago

About ByteString in WebIDL

Specifications should only use ByteString for interfacing with protocols that use bytes and strings interchangeably, such as HTTP. In general, strings should be represented with DOMString values, even if it is expected that values of the string will always be in ASCII or some 8 bit character encoding. Sequences or frozen arrays with octet or byte elements, Uint8Array, or Int8Array should be used for holding 8 bit data rather than ByteString.

Pauan commented 5 years ago

@MaxGraey I am aware. The purpose of WebIDL bindings is to allow many different languages to use WebIDL APIs without using JavaScript.

Since each language does things differently, that means there needs to be a way to convert from one type to another type.

That's why there's a UTF-8 -> WebIDL string conversion, to allow for languages like Rust to use WebIDL bindings (since Rust uses UTF-8).

MaxGraey commented 5 years ago

That's why there's a UTF-8 -> WebIDL string conversion, to allow for languages like Rust to use WebIDL bindings (since Rust uses UTF-8).

So every browser which has already implemented WebIDL bindings for Javascript and rest of languages like C#/Mono, Java, Python and other which still popular today should change its internal string representation? I guess all this languages in total much more popular then Rust no matter how it awesome)

MaxGraey commented 5 years ago

I don't mind utf8-str but I think proposal should care about utf16le-str as well =)

MaxGraey commented 5 years ago

webidl bindings proposal already care about pretty special and still allow only in C/C++ null-terminated strings (utf8‑cstr). So it already care about backward compatibility for legacy approaches)

Pauan commented 5 years ago

So every browser which has already implemented WebIDL bindings for Javascript and rest of languages like C#/Mono, Java, Python and other which still popular today should change its internal string representation?

I'm not sure where you got that idea... you seem to be misunderstanding how all of this works. I suggest you read the recent slides, especially slide 29.

The way that it works is that the browser implements WebIDL strings (using whatever representation it wants, just like how it does right now). And then there are various "binding operators" which convert from other string types to/from the WebIDL strings.

So you can have a binding operator which converts from UTF-8 to WebIDL strings, or a binding operator which converts from UTF-16 to WebIDL strings. The browser doesn't need to change its internal string representation, it just needs to implement a simple conversion function.

I'm also not sure why you're bringing up languages like C#/Mono, Java, or Python... they are also implemented in WebAssembly linear memory, and so they need binding operators. The binding operators are not a "Rust-only" thing, they benefit all languages. That's why it's a UTF-8 conversion, so it can be used by all languages which use UTF-8 strings.

dcodeIO commented 5 years ago

I'm also not sure why you're bringing up languages like C#/Mono, Java, or Python... they are also implemented in WebAssembly linear memory, and so they need binding operators. The binding operators are not a "Rust-only" thing, they benefit all languages. That's why it's a UTF-8 conversion, so it can be used by all languages which use UTF-8 strings.

I believe the point he wanted to make is that all those languages use UTF-16LE internally so all of them would face the potential performance penalty this issue is about.

Pauan commented 5 years ago

I believe the point he wanted to make is that all those languages use UTF-16LE internally so all of them would face the potential performance penalty this issue is about.

Okay, but I never spoke about UTF-16 (which I am in favor of).

I only said that languages which use ASCII do not need a special "ASCII binding operator", since they can use UTF-8 instead.

MaxGraey commented 5 years ago

I only said that languages which use ASCII do not need a special "ASCII binding operator", since they can use UTF-8 instead.

Yes, just one note it's C (C++ probably as well) and it should use utf8-cstr - null-terminated version of utf8-str: https://github.com/WebAssembly/webidl-bindings/blob/master/proposals/webidl-bindings/Explainer.md#binding-operators-and-expressions

dcodeIO commented 5 years ago

So, to recap my perspective a little here, maybe one way to avoid re-encoding on every host-binding call, discriminating languages following another UTF standard, could be to make the encoding kind an immediate operand of utf-str and alloc-utf-str (dropping the 8), with valid encodings being UTF-8 (& UTF-8-zero-terminated?), UTF-16LE and potentially UTF-32 (USVString <-> USVString fallback?). Based on the pair of (source-encoding, target-encoding), the host would either preserve the representation if both are equal, or convert into either one depending on what it deems appropriate.

Since those encodings are relatively similar, I'd say that the implementation isn't a significant burden, while solving the issue for most modern programming languages for good.

If it is decided that WebIDL-bindings should not provide string operations, that'd be fine, but in this case whatever is decided-upon as the alternative should take it into account (note that anything based upon TextEncoder currently doesn't).

Hope that makes sense :)

annevk commented 5 years ago

Note that JavaScript strings are not UTF-16, they're 16-bit buffers. UTF-16 has constraints that JavaScript does not impose.

MaxGraey commented 5 years ago

Yes, In JavaScript most of operations is not "unicode safe" and interpret that 16-bits as UCS-2 except String#fromCodePoint , String#codePointAt, String#toUpperCase/String#toLowerCase and several others. But UTF16LE and UCS-2 has same 16-bit storage so for simplicity most people call that UTF16 encoding

annevk commented 5 years ago

The distinction is nonetheless important because you could imagine a language having support for UTF-16 the way Rust has support for UTF-8 (8-bit buffer with constraints) and that's not a good fit for what OP is asking for.

MaxGraey commented 5 years ago

UCS-2 is a strict subset of UTF-16. It means if we use UCS-2 we could always reinterpret as UTF-16 without any caveats if both have same endian. UTF-16 just understands surrogate pairs - UCS2 isn't.

UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided

So I don't think it's big deal for current topic

annevk commented 5 years ago

Surrogate pairs are not the issue, lone surrogates are.

MaxGraey commented 5 years ago

Yes, sure lone surrogates are problem for UTF16 and UTF8 as well: https://speakerdeck.com/mathiasbynens/hacking-with-unicode-in-2016?slide=106

And I guess it's not be a problem for modern encoders/decoders?

dcodeIO commented 5 years ago

My understanding of UTF-16LE here is based on this piece of information:

Most engines that I know of use UTF-16

The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16

Ultimately this issue isn't solely about the specialties of JS strings of course, hence not so much about UCS-2 as an outdated standard.

jgravelle-google commented 5 years ago

I thought if WebAssembly implement WebIDL binding it should follow WebIDL spec

Maybe. As I had just said:

Sure, @lukewagner presented at the June CG meeting, and here's the slide deck: https://docs.google.com/presentation/d/1wtAknL-UJWDoIgSbyF5paTBSpVVj-fKU4tiHMxJbSzE/edit

tl;dr does this wasm binding layer we're describing need to reason about WebIDL at its core, or is WebIDL another target with a produce/consume pair? If the latter, and we suspect that is the case, then we're free to design an IDL that better matches what we're trying to do, rather than try to retrofit that on top of WebIDL.

Full notes of the accompanying discussion here: https://github.com/WebAssembly/meetings/blob/master/2019/CG-06.md#webidl-bindings-1-2-hrs

Also, WebIDL does not specify a wire format. The second half of the sentence on DOMString:

Such sequences are commonly interpreted as UTF-16 encoded strings [RFC2781] although this is not required

Yes that's splitting hairs.

webidl bindings proposal already care about pretty special and still allow only in C/C++ null-terminated strings (utf8‑cstr).

Yeah I'd been thinking that was a mistake for a while. https://github.com/WebAssembly/webidl-bindings/pull/43

In particular it was proposed in order to show the types of bindings that could be modeled. And indeed, the discussions it spawns has shown that we would need one binding per possible encoding, which maybe 10 years from now is fine. For MVP, no.

jgravelle-google commented 5 years ago

Since those encodings are relatively similar, I'd say that the implementation isn't a significant burden, while solving the issue for most modern programming languages for good.

My issue here isn't so much that the implementation would be a burden, but one of defending against scope creep.

And I'd prefer a scheme where we do spec an entirely new binding for every new encoding, rather than parameterizing over the encoding. Parameterizing doesn't save us implementation effort, but adds some complexity to the binding spec. On that basis, I don't think we're missing any elegance by not specing UTF-16 now and adding it later.

fgmccabe commented 5 years ago

+1 on this.

On Tue, Jul 2, 2019 at 10:11 AM Jacob Gravelle notifications@github.com wrote:

Since those encodings are relatively similar, I'd say that the implementation isn't a significant burden, while solving the issue for most modern programming languages for good.

My issue here isn't so much that the implementation would be a burden, but one of defending against scope creep.

And I'd prefer a scheme where we do spec an entirely new binding for every new encoding, rather than parameterizing over the encoding. Parameterizing doesn't save us implementation effort, but adds some complexity to the binding spec. On that basis, I don't think we're missing any elegance by not specing UTF-16 now and adding it later.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/WebAssembly/webidl-bindings/issues/13?email_source=notifications&email_token=AAQAXUFH4EC6RY52RF6VW43P5OD5DA5CNFSM4EJGLDYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZB6NNI#issuecomment-507766453, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQAXUCOPTCFIEWKSCLXQN3P5OD5DANCNFSM4EJGLDYA .

-- Francis McCabe SWE

dcodeIO commented 5 years ago

And I'd prefer a scheme where we do spec an entirely new binding for every new encoding

+1 on this.

So, with two well-regarded voices positioned against my compromise, that'd essentially mean (as of today) there'd need to be

utf8-str / alloc-utf8-str
utf16-str / alloc-utf16-str
utf32-str / alloc-utf32-str (potentially)

likely leading to the conclusion that having more than one pair of instructions initially is not in the scope of the MVP, which is a much easier point to defend.

by not specing UTF-16 now and adding it later.

That's why I suggested the compromise in the first place, since I think not addressing an issue for non-technical-reasons that multiple languages will run into immediately while seeing it coming is wrong.

To me personally this feels like a proper implementation of the feature is being prevented through the backdoor for the wrong reasons.

jgravelle-google commented 5 years ago

for non-technical-reasons

Whereas I view wasm-bindings in general as a technical mechanism to help resolve the already-non-technical problem of language interop anyway. For me a major guiding principle is that of facilitating coordination between mutually non-cooperating implementors. In particular the bit is mentioned in https://github.com/WebAssembly/design/issues/1274 :

Provide a Schelling point for inter-language interaction This is easier said than done, but I think wasm should send a signal to all compiler writers, that the standard way to interoperate between languages is X.

Wherein the existence of any some standardized mechanism for interop provides a natural target. (Schelling points are fascinating in general). The risk as I perceive it is that people are going to write code whether we provide a mechanism or not. For application developers, "we have a feature coming in ~12 months, maybe" is not something they are going to wait for. So they're going to ship something. I want it to be this, and on that basis it matters hugely whether we can ship in browsers in 2020, vs 2021.

UTF-16, alone, is not going to push us back that far. I'm worried about "but we have UTF-16, so what about..." creeping in. We see that in this thread, with the incredibly-dubious utf8-cstr used as justification for just one more specific binding.

The cruxes of the issue, for me, are:

From the perspective of UTF-16-using languages, I do not see the difference between shipping wasm-bindings v1 in 2020, and v2-with-UTF-16 in 2021, vs adding UTF-16 to v1, but delaying v1 to 2021.
From the perspective of the broader ecosystem, shipping v1 in 2020 vs 2021 can be a massive difference.

The bit that's much more of an open question is, if/when we inevitably add UTF-16, what is the proper mechanism?

On a technical side, I don't see the difference between A) utf8-str + utf16-str B) str(utf8) + str(utf16)

likely leading to the conclusion that having more than one pair of instructions initially is not in the scope of the MVP, which is a much easier point to defend.

because regardless of pairs of instructions, we still have pairs of encodings. The difficulties I see are 1) formalizing that in the spec, and 2) wiring up the host's existing decoders. Neither of those are made simpler by parameterizing over the encoding in the format. So assuming I'm right about limiting scope, we can choose for v1 whether we ship just utf8-str or just str(utf8). Having a parameterized encoding doesn't change the v1-ability of utf16 bindings.

I see two ways I can be wrong about that:

it does reduce implementation complexity
I'm making the wrong tradeoff of M vs V in MVP

Whether to parameterize in the encoding is moreso a matter of taste at that point, though I suspect it's slightly more work to spec, so I favor the separate-instr-per-encoding design on that basis. This is the weakest-held of my opinions though.

I mention all of this because if I am fatally wrong I would much rather know about it now than in 2022.

I have many more thoughts on the meta-side of this but I'm going to make this message a 2-parter for latency purposes.

dcodeIO commented 5 years ago

The first sentence of the explainer also reads

The proposal describes adding a new mechanism to WebAssembly for reliably avoiding unnecessary overhead when calling, or being called, through a Web IDL interface

My expectation would be that "reliably avoiding unnecessary overhead" is a priority, even for an MVP, since it's literally the first sentence, whereas

I view wasm-bindings in general as a technical mechanism to help resolve the already-non-technical problem of language interop anyway

and

I think wasm should send a signal to all compiler writers, that the standard way to interoperate between languages is X.

seem like an overarching goal that does not play well with what the proposal is trying to solve in the first place, especially since we are not even talking exotic encodings here but UTF.

Regarding

UTF-16, alone, is not going to push us back that far. I'm worried about "but we have UTF-16, so what about..." creeping in

it looks like adding the set of well-known UTF encodings is sufficient for an MVP because it covers like 90% of languages, while just UTF-8 is not even close when looking at the list of languages above. Could as well name this proposal "WebIDL-bindings-for-C-and-Rust" then, as my expectation would be that the MVP of the spec remains irrelevant for something like AssemblyScript for who-knows-how-long since alternatives will still be faster.

From the perspective of UTF-16-using languages, I do not see the difference between shipping wasm-bindings v1 in 2020, and v2-with-UTF-16 in 2021, vs adding UTF-16 to v1, but delaying v1 to 2021.

From the perspective of the broader ecosystem, shipping v1 in 2020 vs 2021 can be a massive difference.

That's an assessment I do not share. While shipping it in the MVP does indeed involve additional work, I can't see how "slightly more work to spec" or "wiring up the host's existing decoders" would lead to delays of such magnitude.

Having a parameterized encoding doesn't change the v1-ability of utf16 bindings.

I agree on that, just wanted to point out the potential misconception that might arise here in that holding back on multiple instructions can magically seem more appropriate, even though the underlying concern remains unchanged (which I think just began).

Edit: Maybe another point: If it was an (extensible) operand, implementers could opt to support UTF-16 early, but it's not that easy if it requires an entirely new instruction, with the only alternative being to wait.

Edit: Maybe one more point: I thought that the specs benefit from being used by multiple ecosystems (that worked for the WASM MVP at least), but what's proposed here is doing the exact opposite, essentially making the MVP only viable to the first-class club. That's sad, because we'd like to be involved.

Apart from that, I feel that I should mention that I appreciate your thorough comments, even though I don't agree with certain aspects. :)

jgravelle-google commented 5 years ago

To me personally this feels like a proper implementation of the feature is being prevented through the backdoor for the wrong reasons.

I for one welcome spirited debate, and hope it doesn't feel like we backdoor any of this. To that end, I've been wanting to put together an informal video/voice chat so interested parties can have more high-bandwidth discussions than github issues allow. I'll draft some sort of pre-work for that, probably tomorrow, and post it as an issue.

There will also be more opportunities to say "you're doing it wrong" when we have a prototype implementation in a browser, and you'll be able to target that, and show us with data how suboptimal it is exactly. My prediction is that AS->wasmBindings->Host will be more efficient than AS->JS->jsBindings->Host, even with an extra reencoding. There's a couple ways that measurement could turn out, and the right thing to do will depend on the in-practice data.

While shipping it in the MVP does indeed involve additional work, I can't see how "slightly more work to spec" or "wiring up the host's existing decoders" would lead to delays of such magnitude.

Not from UTF-16 alone, but on the assumption that a less hard-nosed approach stance on what is in scope for MVP could lead to 2x the binding operations, and that would add months of time. Bit of a slippery slope.

A non-small part of that is I personally want All The Bindings for All The Languages, but if we don't have clear launch criteria we could spend years specifying this. My strategy to get everything is to make sure we have enough room to extend the design for v2+, so we can do something now and everything later. Where that line is is negotiable, and it is likely that I am overcorrecting because I have this argument with myself on a regular basis :)

Could as well name this proposal "WebIDL-bindings-for-C-and-Rust" then,

Could probably name the MVP of WebAssembly as C-and-RustAssembly on a similar basis. By analogy, 1) we see that in practice people build things on top of wasm anyway, and 2) post-MVP wasm is becoming increasingly awkward for C and Rust to support (we don't have a good story for anyref in C for example).

I agree on that, just wanted to point out the potential misconception that might arise here in that holding back on multiple instructions can magically seem more appropriate, even though the underlying concern remains unchanged

Agree. Also by "slightly more work to spec" I mean one instr w/ two params vs two instrs, because that's three parts + a composition instead of just two parts.

Maybe another point: If it was an (extensible) operand, implementers could opt to support UTF-16 early, but it's not that easy if it requires an entirely new instruction, with the only alternative being to wait.

Disagree, any wasm compiled to that target isn't interoperable either way, and the implementation effort of having both still isn't wildly different (for either producer or consumer).

Apart from that, I feel that I should mention that I appreciate your thorough comments, even though I don't agree with certain aspects. :)

Thanks, that's good to hear :)

dcodeIO commented 5 years ago

There will also be more opportunities to say "you're doing it wrong" when we have a prototype implementation in a browser

Having to re-encode twice in any UTF16->UTF8->UTF16 scenario (like AS calling JS APIs) does seem like sufficiently unnecessary overhead already that we should avoid regardless of any eventual findings with a prototype. If the prototype shows that this is still faster, my conclusion wouldn't be that it's fast enough, but that the alternatives are too slow.

My prediction is that AS->wasmBindings->Host will be more efficient than AS->JS->jsBindings->Host, even with an extra reencoding

Make that two extra re-encodings. One use case I think of in this regard btw is AS code extensively calling Canvas2D APIs which take colors, fill styles and whatnot as strings, and to me it looks like something custom, for example using a lookup array of generated ids to string refs when calling out to the host, will be a serious contender to re-encoding hell if function imports are sufficiently optimized.

One could even go as far as to conclude that picking UTF-8 as the initial default is at least as arbitrary as picking UTF-16 as the initial default, depending on whether one is looking at this from a producer or a consumer standpoint. Like, the spearheading runtimes for the spec will be the major browsers, and the majority of bound APIs will be JS APIs. So one could argue that picking UTF-16 would make a more reasonable initial default. Not saying that, but excluding UTF-16 from the MVP feels even more arbitrary on this background to me.

To that end, I've been wanting to put together an informal video/voice chat so interested parties can have more high-bandwidth discussions than github issues allow. I'll draft some sort of pre-work for that, probably tomorrow, and post it as an issue.

👍

fgmccabe commented 5 years ago

I am not sure I agree that UTF16 is a better default than UTF8. UTF16 suffers from the same issues (unmatched surrogate pairs) as UTF8 (so its not simpler); whereas for many strings UTF8 is more compact. Furthermore, the host bindings story is not intended to be specific to browsers; although that is a major initial use case. It also supports (will support) inter-module bindings and bindings to non-browser host APIs.

Having said that, I am not completely averse to having a UTF16 variant of any string binding operator. The delta is probably minimal (and we may spend more time arguing about it than it would take to implement).

There is a slippery slope argument to be had; and that is why Jacob has been rightfully resisting. On the other hand, strings are arguably THE most important data structure. On the other other hand, where do we stop with strings? E.g., it is entirely possible that more applications use code pages than UTF. ...

On Tue, Jul 2, 2019 at 10:37 PM Daniel Wirtz notifications@github.com wrote:

There will also be more opportunities to say "you're doing it wrong" when we have a prototype implementation in a browser

Having to re-encode twice in any UTF16->UTF8->UTF16 scenario (like AS calling JS APIs) does seem like sufficiently unnecessary overhead already that we should avoid regardless of any eventually findings with a prototype. If the prototype shows that this is still faster, my conclusion wouldn't be that it's fast enough, but that the alternatives are too slow.

My prediction is that AS->wasmBindings->Host will be more efficient than AS->JS->jsBindings->Host, even with an extra reencoding

Make that two extra re-encodings. One use case I think of in this regard btw is AS code extensively calling Canvas2D APIs which take colors, fill styles and whatnot as strings, and to me it looks like something custom, for example using a lookup array of generated ids to string refs when calling out to the host, will be a serious contender to re-encoding hell if function imports are sufficiently optimized.

One could even go as far as to conclude that picking UTF-8 as the initial default is at least as arbitrary as picking UTF-16 as the initial default, depending on whether one is looking at this from a producer or a consumer standpoint. Like, the spearheading runtimes for the spec will be the major browsers, and the majority of bound APIs will be JS APIs. So one could argue that picking UTF-16 would make a more reasonable initial default. Not saying that, but excluding UTF-16 from the MVP feels even more arbitrary on this background to me.

To that end, I've been wanting to put together an informal video/voice chat so interested parties can have more high-bandwidth discussions than github issues allow. I'll draft some sort of pre-work for that, probably tomorrow, and post it as an issue.

👍

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/WebAssembly/webidl-bindings/issues/13?email_source=notifications&email_token=AAQAXUHKI75JMFUR2WR4QF3P5Q3KRA5CNFSM4EJGLDYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZDKEXI#issuecomment-507945565, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQAXUCLUWXEGKOJAUGSPIDP5Q3KRANCNFSM4EJGLDYA .

-- Francis McCabe SWE

annevk commented 5 years ago

I think the nuance of my comments got missed above, but to reiterate neither UTF-8 nor UTF-16 technically allow unpaired surrogates. (They have an identical value space.) A DOMString being a 16-bit buffer does allow for representing unpaired surrogates. USVString does not. ByteString is best represented by an 8-bit buffer.

A question with either UTF-8 or UTF-16 representation is how you deal with unpaired surrogates. Map them to U+FFFD, trap, something else?

dcodeIO commented 5 years ago

I am not sure I agree that UTF16 is a better default than UTF8.

Yeah, I mostly made that argument to underline that UTF-16 is not less important, even though I know about the importance of UTF-8 when looking at it the other way around. Ideally, both would be supported in the MVP so both perspectives are equally covered.

There is a slippery slope argument to be had; and that is why Jacob has been rightfully resisting.

I understand that keeping the MVP MV is an important aspect, and that being careful with what to include makes perfect sense. Though I think that the case of UTF-16 does not fulfill the knock-out criterion.

On the other other hand, where do we stop with strings? E.g., it is entirely possible that more applications use code pages than UTF

My suggestion would be to stop at UTF-8 and UTF-16 for the MVP, which sufficiently cover the most common encoding on the producer plus the most common encoding on the consumer side, and think about everything else in a v2.

A question with either UTF-8 or UTF-16 representation is how you deal with unpaired surrogates. Map them to U+FFFD, trap, something else?

That's a good question, yeah. My immediate though on this is that the binding shouldn't impose any restrictions that'd invalidate likely scenarios or lead to significant overhead, and leave producing errors to the implementations that actually require it. In the best case, that's no scan at all, while in the worst it's one scan to assert additional restrictions.

That'd essentially mean that the binding wouldn't care about the interpretation of the byte data, which it most likely won't do anyway at runtime in order to be fast, with the encoding being more an indicator. Now I'm not exactly sure about the UTF8<->UTF16 case, but it looks to me that the problem is similar in both so piping through unpaired surrogates as-is, delegating any checks further down the pipeline, seems to be the most sensible thing to do.

jgravelle-google commented 5 years ago

One could even go as far as to conclude that picking UTF-8 as the initial default is at least as arbitrary as picking UTF-16 as the initial default, depending on whether one is looking at this from a producer or a consumer standpoint

That is an extremely valid point. Honestly my general point of "let's only ship one encoding in MVP" would be content with only having UTF-16. I suspect that would be controversial :D

and we may spend more time arguing about it than it would take to implement

I'm weirdly kind of ok with that being the calculus, though I wonder if that sets a dangerous precedent...

There is a slippery slope argument to be had

Thinking about this more, cstr and utf-16 are different enough in character that I think the slipperiness of the slope is less dangerous. cstr is for one language and has immediate obviously-better options available, utf-16 is used in more places, up to and including the browser itself.

That "browser itself" part is probably the most compelling, because even in a C program you may want maximum performance, and because Blink's wtfStrings are UTF-16, you will necessarily pay a single ASCII->UTF-16 encoding cost at every boundary. For that reason a C program that wanted to avoid that could use a JSString that is UTF-16 encoded, and reuse that for multiple calls without needing to re-encode each time.

For that reason I think we should probably ship with both. It's not strictly M, but it should be useful enough to warrant it. Minimality is not itself an axiomatic condition.

My immediate though on this is that the binding shouldn't impose any restrictions that'd invalidate likely scenarios or lead to significant overhead, and leave producing errors to the implementations that actually require it.

👍

annevk commented 5 years ago

For a 16-bit buffer no validation would be required (please don't call it UTF-16 in that case), but for an 8-bit buffer you would need to define some conversion process as even if you permit WTF-8 (which allows unpaired surrogates), there'd still be invalid sequences that you'd need to handle somehow as they cannot be mapped to a 16-bit buffer (i.e., DOMString). And for USVString there are more tight requirements, which if you don't handle them via a type, will instead incur a cost at the binding layer, which isn't exactly great. So not imposing any restrictions at all does not seem like the kind of thing you'd want here.

rossberg commented 5 years ago

Thanks @jgravelle-google for trying to hold the line.

A fundamental property of this proposal is that the implementation complexity is gonna be O(N^2) in the number of binding operators for each given type. So adding "just one more" is not the no-brainer it may seem.

There is a choice to be made. Either keep this mechanism Web-specific. Then it can be overloaded with all sorts of Web/JS goodies like UTF-16; respective engines are hyper-complicated already. But I doubt any non-Web engine would want to implement it.

Or make this mechanism more universal, i.e., the snowman idea. Then it is crucial for wider adoption in engines to keep it as small as possible. Putting Unicode transcoders into every core Wasm engine is not the route to go down, and misses the point of Wasm.

All The Bindings for All The Languages

That is completely unrealistic and cannot be the goal. Rule of thumb: there are (at least) as many data representations as there are languages. There is no canonical set. It would literally mean hundreds of language-specific binding operators (remember: N^2 complexity) -- all baked into a code format that supposedly was low-level and language-independent.

dcodeIO commented 5 years ago

So, from those two comments, I'd take away that asserting the well-formedness of either or both UTF-8 and UTF-16 would require at least a validation scan on every boundary (since the producer might be doing something wrong), leading to the binding having to do significantly more work, in turn requireing Wasm engines to ship significantly more code, which contradicts the purpose of this proposal.

Hence I suggest to add to the spec that the binding does not ensure well-formedness of the encoding for those reasons, and that either

if a consumer requires inputs to be well-formed, it must assert this condition on its own where necessary or take the respective measures to deal with ill-formed sequences in a more general way.

or

if a producer wants to ensure that such a string is understood by a consumer, it should take the respective measures for the exact API call in particular, since it is perfectly possible that a producer is dealing with multiple consumers with different requirements.

Not sure which of the provided alternatives is best. The first doesn't restrict producers, the second doesn't restrict consumers. From a solely "avoid unnecessary overhead" perspective, it looks like the second might be more straight forward because it requires less general defenses.

To be more concrete, if a potentially ill-formed UTF16->UTF8 conversion is taking place, the unpaired surrogate should become an ill-formed single code point representing its value (as three bytes), essentially piping through ill-formedness. Likewise, if a potentially ill-formed UTF8->UTF16 conversion is taking place, the respective code point should become an unpaired surrogate again, essentially piping though ill-formedness.

In general, the WTF-8 encoding seems to fit this well, since it has been created in presence of this relatively common scenario (if I'm not missing something they do differently from what I've written above).

WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired. https://simonsapin.github.io/wtf-8/

WTF-16 is sometimes used as a shorter name for potentially ill-formed UTF-16 https://simonsapin.github.io/wtf-8/#ill-formed-utf-16

WTF-8 (Wobbly Transformation Format – 8-bit) is an extension of UTF-8 where the encodings of unpaired surrogate halves (U+D800 through U+DFFF) are allowed. This is necessary to store possibly-invalid UTF-16, such as Windows filenames. Many systems that deal with UTF-8 work this way without considering it a different encoding, as it is simpler. https://en.wikipedia.org/wiki/UTF-8#WTF-8

It appears that this is the least common denominator here. I have no strong opinion on additional encodings, but do somewhat agree with rossberg. Yet, the two encodings mentioned here appear necessary in JS/Browser<->JS/Browser (here: not only browsers but anything that does it the JS-way) and C/Native<->C/Native scenarios, which are by far the most likely ones, while also allowing both to talk to each other and giving any other use case the option to pick the one that fits its use cast best.

If I'm missing something, please point out what it is :)

jgravelle-google commented 5 years ago

A fundamental property of this proposal is that the implementation complexity is gonna be O(N^2) in the number of binding operators for each given type.

Strong disagree. We only need N^2 work if each pair of bindings needs a unique implementation. If we don't specify the runtime characteristics, then a reasonable implementation might look like: Translate from A.wasm to some engine-specific IR (O(N) implementation effort) + engine IR to B.wasm (O(N) implementation effort). This then leaves room to optimize a subset of end-to-end bindings, e.g. we could then specially optimize UTF-16 -> UTF-16 to be a memcpy, but not UTF-8 -> UTF-8 (or vice-versa), which implies O(1) additional work that's separable from a minimal implementation of the standard.

Or make this mechanism more universal, i.e., the snowman idea.

I believe that's what I described just here, if there's more missing there let me know. The key is whether the central snowman types are conceptual or reified, and how much wiggle room we leave in the spec.

That is completely unrealistic and cannot be the goal. Rule of thumb: there are (at least) as many data representations as there are languages. There is no canonical set. It would literally mean hundreds of language-specific binding operators (remember: N^2 complexity) -- all baked into a code format that supposedly was low-level and language-independent.

Just because I want something doesn't mean I think it's realistic.

Though the broader point is that I don't believe we need completely seamless bindings to support all the languages well-enough to be worth doing. The mapping doesn't need to be total. I think of it as being similar to a Voronoi diagram, in that we just need a set of points that are reasonably close to existing languages.

For example, if we only had UTF-16, C would need to re-encode from a char* into a wchar*. But if we only had cons-list-of-gc-chars, C would need special support in the compiler to be able to handle that. I don't expect every language to map perfectly, but not all imperfect mappings are equal.

rossberg commented 5 years ago

We only need N^2 work if each pair of bindings needs a unique implementation.

To actually have any benefit from additional binding types the engine will need a specialised implementations for it, or am I missing something? What's the point of adding them when they do not optimise at least a few paths? And if they do, that puts you at O(N^2) (note the O, though).

jgravelle-google commented 5 years ago

Preface: this got rather long and general. About half of the detail/pedantry here isn't meant to be directly confrontational, but more as a response to getting these questions a bunch, and actually putting my thoughts in a public forum.

To actually have any benefit from additional binding types the engine will need a specialised implementations for it, or am I missing something?

For any performance benefit, and even then only kinda. There's three tiers of speed:

needing to go from Wasm->JS, through some amount of JS conversion glue code (creating JS strings from memory, table management for references), then from JS->Host
being on-par with JS, being able to go Wasm->Host roughly as quickly as JS->Host
being able to eliminate almost all conversions, making Wasm->Host calls on par with Host->Host calls

(Note that "Host" here could also, and almost surely does, mean "another Wasm module". If Host was always the embedder, this would be O(N) effort in all cases)

Today we have 1). I believe that even with an O(N+M) non-specialized solution, we should get within a sub-2x factor of 2); we may have an additional conversion, but can avoid the dynamic checks of JS, which also needs to convert once. I'm not actually sure we can reach 3), and surely not for all N^2 combinations.

My performance goal is to get us from 1) to 2), and for that I believe an O(N+M) strategy is sufficient.

Further, my non-performance goal is to have a inter-module communication channel that does better than C FFI. To that end, more binding types offers more flexibility for producers, and offers more value in the "better than C FFI" front. On one extreme, we could model bindings as being isomorphic to a C ABI, but at that point we have done nothing to improve the state of the world in that dimension. And I believe that facilitating an ecosystem of intercommunicating, distributed, mutually-untrusting Wasm modules is more important than ensuring a performance characteristic of 3) at the binding layer.

at least a few paths? And if they do, that puts you at O(N^2)

Depends on how you define "a few". If a few means "a constant x% of all possible pairs", then yes. If a few means "these three specific pairs we care about", then no. My understanding is that engines optimize based on usage, and that usage patterns tend to follow Zipf's Law, making the effort O(N+M) for simple bindings + O(log(N*M)) for optimized.

But, more crucially, the degree of implementation effort becomes a choice for the implementors. If we mandate that all bindings must be equally fast, (aside from being impossible) then we require O(N^2) work. If we simply state that all bindings must function, then engines have more leeway in how much surface area requires optimization, and can gradually increase (or decrease!) the amount over time.

rossberg commented 5 years ago

@jgravelle-google, I'm having trouble making out wether you are talking about webidl bindings or the generalised snowman bindings idea. Because going through JS seems to implicitly assume the former.

I'm fine with putting all sorts of ad-hoc complexity into webidl bindings, since they target JS and the Web, which are concrete and hyper-complicated ad-hoc beasts already. Go wild, I don't mind!

I'm only arguing about snowman. If we want to abstract away from webidl then we'd neither want to specialise for particular languages nor for particular host environments. Your points (1) and (2) are not even applicable in that setting. Moreover, (3) is fundamentally impossible when module and host don't share the same representations. So from that perspective, we should avoid picking arbitrary points of comparison and focus on simplicity and generality.

But, more crucially, the degree of implementation effort becomes a choice for the implementors.

The problem I see with making that a selling point is that it doesn't mesh well with one of Wasm's basic goals: predictable performance.

dcodeIO commented 5 years ago

Exploring the "simplicity" and "generality" roads a bit further:

Simplicity: Dictating a specific encoding leads to those pesky re-encode twice scenarios if both producer and consumer don't match it.
Generality: Making strings a (buffer, encoding) pair with the consumer deciding how to deal with it means that every single consumer has to ship all the necessary encoders.

As such, I can't see how snowman bindings aren't affected as well, e.g., C (or Rust) talking to .NET (or AssemblyScript), independently of whether one of these is the host. To me it seems that this requires a reasonable compromise anyway, with the foremost thing to generalize being to make no distinction between a client and a host (but that'd mean WebIDL == snowman).

WebAssembly / interface-types

STRING i32 pairs for UTF-16LE #13