WebAssembly / stringref

Other
37 stars 2 forks source link

Is there a better name for `stringview_wtf16` #12

Open wingo opened 2 years ago

wingo commented 2 years ago

I hesitate to open this issue, but here goes!

When implementing stringref in V8 I have to butt up against the GC proposal and how stringref fits in. Something like this:

stringref = (ref string null)
stringview_wtf8 = (ref stringview_wtf8 null)
stringview_wtf16 = (ref stringview_wtf16 null)
stringview_iter = (ref stringview_iter null)

But I notice that it's a bit silly to have the shorthand of everything else (anyref, arrayref, stringref, etc) have a "ref" suffix but not string views.

Also, I think we can all agree that stringview_wtf8 is not a nice name for a type :)

Three possibilities:

  1. Leave it as it is. It's fine.
  2. Change the views to have shorter names and add prefixes to shorthands. Like:
    stringref = (ref string null)
    stringv8ref = (ref stringv8 null)
    stringv16ref = (ref stringv16 null)
    stringviterref = (ref stringviter null)
  3. Galaxy brain: remove the shorthands. Instead always use (ref string null), (ref stringview_wtf8 null), etc, using the two-byte formulation.

Bikeshed painting time!

kripken commented 2 years ago

I noticed this now as well when starting to prototype this proposal in Binaryen. Specifically, the natural "heap type" of stringref is string (just like dataref => data), but that breaks down for the views and iter as you said.

I'd vote for option 2 myself. Though I'd prefer option 3 overall, if we changed the GC proposal that way (I find the shorthands more confusing than useful... though maybe if the default nullability were consistent I'd change my mind).

dcodeIO commented 2 years ago

Perhaps a fourth possibility that might have other benefits as well: Remove the views and move them as operations upon stringref? For instance, the two strategies for producers I see are:

  1. Make views from refs at the boundary on their way in, and refs from views at the boundary at their way out. This might not be viable if eq is only present on the refs.
  2. Make views on demand whenever needed, possibly a lot of times depending on how the string is used (say in loops or in the worst case each time .charCodeAt is invoked). Now I would expect engines to cache/avoid this redundant work somehow, but then the operations could as well be on top of stringref directly I guess, matching about what other instructions like i32.load_xy do, just with encodings?

Thought I mention because doing so would solve this issue as well (just stringref and stringiterref) :)

conrad-watt commented 2 years ago

Remove the views and move them as operations upon stringref?

FWIW this is also my inclination. This would also address current discussions in this repo in the vein of "what's the story for views?" (e.g. subtyping, UTF8 policy, JS API, this naming issue), which I think might be smells that the distinction between views and stringref could be collapsed.

EDIT: also, there are currently redundancies like the need for length, encode etc instructions for both stringref and associated views.

jakobkummerow commented 2 years ago

As a reminder, the purpose of the views is to make certain expensive operations explicit. If the "Java-like language running in a browser" scenario is all you care about, then getting rid of the views is perfectly fine (and since that's the scenario I personally care most about, I wouldn't mind). But I expect that the larger Wasm community cares a lot about non-Java-like languages and non-browser Wasm engines, and to make the various possible combinations of source language's string encoding choice and engine's internal string encoding choice all as efficient (and controllable/predictable) as possible, the views are a crucial tool for which no equivalently-capable alternative has been suggested so far.

Regarding the thread-starting options: I generally don't care much about the text format (because "nobody writes it by hand" aside from test cases; and for debugging, what we have / option (1) seems good enough), so I don't feel strongly about it either way. I'm not a huge fan of the existing approach to shorthands (dataref and friends), so if it was up to me I'd probably pick option (3). We can always add shorthands later if/when we have data indicating that they would provide tangible benefits.

dcodeIO commented 2 years ago

But I expect that the larger Wasm community cares a lot about non-Java-like languages and non-browser Wasm engine

It seems this aspect is somewhat similar to the larger Wasm community caring a lot about Java-like languages and browser Wasm engines, yet string turned out to be fundamentally incompatible with Java-like languages and browser Wasm engines, while what is suggested here is that stringref would sacrifice itself for non-Java-like languages and non-browser environments. I'd like to suggest to be at least consistent, that is either

  1. string and stringref indeed target different languages and environments, then each is optimal in its native habitat - or -
  2. string and stringref should work well for different languages and environments, but then string needs to be revised.
jakobkummerow commented 2 years ago

@dcodeIO I don't know what "string" you're talking about, or what "sacrifice". This proposal, stringref, is specifically designed to work well for all source languages and all engines (definitely including Java/Kotlin/Dart in the browser, while also scaling to utf8-based source languages and/or utf8-based engines). From past discussions, I believe you are specifically interested in zero-overhead interop with JavaScript, and I can assure you that that's being provided by the stringref design.

dcodeIO commented 2 years ago

Indeed, having a functioning string type for interop with JavaScript is what I need specifically, but I am equally interested in designing Wasm in a consistent way for many languages and many environments. As such my observation above that we'll eventually have two string types under the same name (the other is Interface Types', now CM, string), that represent fundamentally different concepts. And with string being the precedent here, that is the CG has decided that it is fine to have a string type that is fundamentally incompatible with Java-like languages and browser Wasm engines, and given that similar considerations apply to stringref as well, that leads to the two options above if overall consistency is a goal, which in turn is related to the aspect you mentioned when taking the prior line of thought (the "sacrifice") into account.

rossberg commented 2 years ago

I believe you are specifically interested in zero-overhead interop with JavaScript, and I can assure you that that's being provided by the stringref design.

I think this statement needs qualification. In practice, it is only going to be true for languages from the mid/late-90s when UCS-2 still was a thing, or some later ones designed specifically to target/interop with those.

rossberg commented 2 years ago

FWIW, option (2) seems preferable to me, though I would suggest renaming stringviter to stringv32. I would also suggest making the view/iter functionality more regular (i.e., give them all a simple get function, drop the redundant next for iter).

jakobkummerow commented 2 years ago

@conrad-watt Putting operations onto stringref sounds nicely simple in principle, but how would you specify operations that need indices of any kind? E.g. "substring", or just "get-nth-thing": how would you, even informally, describe the performance expectations for get_nth_utf8_byte, when the stringref may well be wtf16-encoded under the hood? Or vice versa, how would you make sure that translating a Java-style get_nth_wtf16_codeunit-based loop to Wasm doesn't become quadratic on an engine that prefers utf8-based string storage internally? And would the end result of having "instruction families" like substring_by_utf8_byte, substring_by_wtf16_codeunit, substring_by_codepoint really be preferable over the stringview based approach?

there are currently redundancies like the need for length, encode etc instructions for both stringref and associated views

I agree that these look redundant at first glance; but even something as simple-sounding as "string length" turns out to be surprisingly complex in practice. I've found this article insightful (long read, but recommended: I for one have learned a lot). Certainly, a Wasm string system could just choose any one of the definitions for "🤦🏼‍♂️".length and hope that anyone translating their language to Wasm can somehow work with that choice, but that seems... rather over-optimistic to me. The stringviews are a way to offer different use cases what they need, while also minimizing overhead (as far as possible) and maximizing implementation freedom.

conrad-watt commented 2 years ago

@jakobkummerow FWIW your initial comment in this issue prompted me to do some thinking and helped me to be more comfortable with the current design. My only remaining concern would be ensuring that the cost of creating a view at each boundary doesn't get too high, but I agree that analogous and arguably worse concerns can arise in the view-less design (e.g. your get_nth_wtf16_codeunit concern above).