WebAssembly / interface-types

Other
641 stars 57 forks source link

Refresh, rescope and rename #21

Closed lukewagner closed 5 years ago

lukewagner commented 5 years ago

This PR updates the Host Bindings proposal as discussed at TPAC and after a bunch of follow-up discussions.

The PR has an overview that I won't restate but one additional proposed change that I'd like to make (if we accept this PR) is to rename this repo from "host-bindings" to "webidl-bindings", to reflect the new scope of the proposal (for reasons given in the FAQ of the PR).

To make sure we have proper visibility and discussion, I'll add an agenda item to the April 2 CG meeting for discussion.

lukewagner commented 5 years ago

(Accidentally posted before filling out the PR, somehow. Updated with title/description.)

lukewagner commented 5 years ago

Great to hear, thanks for the feedback. I'll hold off merging until a chance to discuss in the next CG meeting.

annevk commented 5 years ago

Very exciting!

(given how, sometimes, specifications are refactored over time to have different IDL).

This is probably worth calling out explicitly as this is indeed somewhat common, including changing long to double and such). I take it this works for both input and return values? Really cool tactic, though I guess that means there will be a little bit more overhead.

Am I correct that the intention here is to always handle DOMString via a UTF-8 buffer? What happens if the DOMString contains lone surrogates?

It might be worth considering to map ByteString to a buffer (which I'm assuming is a byte sequence). The type is primarily used for byte sequences that happen to look a lot like strings and can be manipulated as such to some extent (e.g., HTTP header values). I guess the question is to what extent that distinction is meaningful for Wasm. Concretely, if a header value is 0xFF, do you want to see that as 0xFF or U+00FF expressed as two UTF-8 bytes?

lukewagner commented 5 years ago

@annevk Thanks for the comments; PTAL to see if they were addressed correctly.

This is probably worth calling out explicitly as this is indeed somewhat common, including changing long to double and such).

I added a third bullet to "Question 2" in the "JS API Integration" section.

I take it this works for both input and return values? Really cool tactic, though I guess that means there will be a little bit more overhead.

Yes, input and return values (both being part of the "Web IDL Signature" which can mismatch. This is an instantiation-time check that only takes the slower, roundabout path on mismatch, so hopefully no overhead in the common case.

Am I correct that the intention here is to always handle DOMString via a UTF-8 buffer? What happens if the DOMString contains lone surrogates?

Ah, good question; I glossed over this. So I think this only comes up for the alloc‑utf8‑str incoming binding operator. My default assumption is that this case should trap (in general there are a number of cases where binding operators can trap). I'll add a note to alloc-utf8-str.

It might be worth considering to map ByteString to a buffer (which I'm assuming is a byte sequence). The type is primarily used for byte sequences that happen to look a lot like strings and can be manipulated as such to some extent (e.g., HTTP header values). I guess the question is to what extent that distinction is meaningful for Wasm. Concretely, if a header value is 0xFF, do you want to see that as 0xFF or U+00FF expressed as two UTF-8 bytes?

Good point! Thinking about it some more, ByteString is practically a BufferSource; its only essential difference is that the ECMAScript Binding layer spits out a JS string instead of an ArrayBuffer (for obvious historical reasons). Thus, I'll remove ByteString from the string ops and add it to the buffer ops.

annevk commented 5 years ago

Trapping is some kind of error? So if you store a lone surrogate in a Text node's data in JavaScript you cannot get it out through Wasm? Mapping the lone surrogates to U+FFFD seems preferable if so.

cc @hsivonen

alexcrichton commented 5 years ago

We've been discussing the problem of lone surrogates recently with wasm-bindgen, and our default usage of TextEncoder to convert a JS string to a Rust utf-8 string does use the replacement character today. What it means though is that we didn't realize that the conversion from a JS string to a Rust string was lossy initially, although we do know now!

Our current thinking is that we will add a method to detect if a JS string is invalid utf-16 (aka has a lone surrogate), and then code that wants to be robust will do the moral equivalent of taking anyref and doing runtime checking.

In that sense I think that I might agree with @annevk that using replacement characters on lone surrogates might make more sense since it matches TextEncoder's behavior. It would be important to document though!

Pauan commented 5 years ago

To clarify a bit, when the user types a single character in an <input> on Windows, it will send two input events (one for each surrogate pair).

So when the first input event happens, it sends the string from JS to Rust (which uses TextEncoder to convert from UTF-16 to UTF-8), and TextEncoder replaces the unpaired surrogate with the replacement character.

That's pretty terrible since now the string doesn't match what the user typed (and this then leads to other bugs).

So personally I would prefer a hard error, but I agree with @alexcrichton that matching existing TextEncoder behavior is probably correct (albeit unfortunate).

annevk commented 5 years ago

@Pauan is there an issue tracking that particular way of dealing with user input? That seems like something that ought to be fixed in implementations (and specification, if it's unclear).

(And yeah, TextEncoder expects the caller to not pass in lone surrogates (because it uses USVString, which replaces lone surrogates with U+FFFD, not DOMString, which preserves them). We could potentially add a static or some such that tells you whether your string contains them. Please file something at https://github.com/whatwg/encoding if that would be useful.)

Pauan commented 5 years ago

@annevk is there an issue tracking that particular way of dealing with user input?

On our side we have https://github.com/rustwasm/wasm-bindgen/issues/1348 (linked earlier by @alexcrichton).

The solution we decided upon was to just completely ignore input events which contain unpaired surrogates. This of course requires detecting (in JS) whether a string contains unpaired surrogates or not.

As for a tracking issue in browsers or the spec, not that I'm aware of. But then again, I haven't really looked.

That seems like something that ought to be fixed in implementations (and specification, if it's unclear).

If it could be fixed in browsers, then that would be fantastic! Right now we have to ignore input events which contain unpaired surrogates.

We could potentially add a static or some such that tells you whether your string contains them.

Yeah, that would be useful! Right now we have to use a function like this, but it can probably be implemented faster in browsers.

lukewagner commented 5 years ago

Oops, right. Ultimately, I assumed the spec here would delegate to TextEncoder, so I'll just say that explicitly instead of trying to incorrectly guess what it would do.

lukewagner commented 5 years ago

Following the unanimous poll at the last CG meeting and no objections since, I'll merge, rename the repo, and we can continue to iterate on the design going forward.

fgmccabe commented 5 years ago

In this case there is clear conflict: between the prior precedents of webidl and that of WebAssembly.

IMO webidl should prevail because otherwise there will be a lot of unnecessary reformatting needed. Furthermore we would end up having to restate the webidl specification and also having to own that restatement.

On the other hand there is no binary representation of webidl and so we have to invent it. In that case the binary format is consistent with WebAssembly

On Thu, Apr 4, 2019 at 8:57 PM Andreas Rossberg notifications@github.com wrote:

@rossberg commented on this pull request.

In proposals/webidl-bindings/Explainer.md https://github.com/WebAssembly/webidl-bindings/pull/21#discussion_r272437868 :

+used as Web IDL Types. + +As an example, the Web IDL Dictionary: + +```WebIDL +dictionary Contact {

  • DOMString name;
  • long age; +}; +```
  • +could be defined in a WebAssembly module with this (strawman text-format) +statement:

  • +```wasm +(webidl-type $Contact (dict (field "name" DOMString) (field "age" long)))

Can we try to uniformly render custom sections in the text format by means of the proposed custom annotation syntax https://github.com/WebAssembly/annotations/blob/master/proposals/annotations/Overview.md? That would prevent all sorts of compatibility and tooling issues.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/WebAssembly/webidl-bindings/pull/21#discussion_r272437868, or mute the thread https://github.com/notifications/unsubscribe-auth/ACAL0IrVlkpPwNDy21mZhDWLEhE3Bu83ks5vdsnAgaJpZM4cEhyN .

-- Francis McCabe SWE

rossberg commented 5 years ago

FWIW, the custom annotation syntax should probably be able to embed existing WebIDL syntax:

(@webidl
  dictionary Contact {
    DOMString name;
    long age;
  };
)

would be syntactically well-formed. Whether it's desirable stylistically I'll leave for others to decide.

lukewagner commented 5 years ago

Good to know; probably best to open a new issue to discuss Web IDL vs s-expression syntax.

littledan commented 5 years ago

I'd suggest not using the WebIDL surface syntax, as this has changed over time and is continuing to change.