Open wingo opened 2 years ago
I noticed this now as well when starting to prototype this proposal in Binaryen. Specifically, the natural "heap type" of stringref
is string
(just like dataref => data
), but that breaks down for the views and iter as you said.
I'd vote for option 2 myself. Though I'd prefer option 3 overall, if we changed the GC proposal that way (I find the shorthands more confusing than useful... though maybe if the default nullability were consistent I'd change my mind).
Perhaps a fourth possibility that might have other benefits as well: Remove the views and move them as operations upon stringref
? For instance, the two strategies for producers I see are:
eq
is only present on the refs..charCodeAt
is invoked). Now I would expect engines to cache/avoid this redundant work somehow, but then the operations could as well be on top of stringref
directly I guess, matching about what other instructions like i32.load_xy
do, just with encodings?Thought I mention because doing so would solve this issue as well (just stringref
and stringiterref
) :)
Remove the views and move them as operations upon
stringref
?
FWIW this is also my inclination. This would also address current discussions in this repo in the vein of "what's the story for views?" (e.g. subtyping, UTF8 policy, JS API, this naming issue), which I think might be smells that the distinction between views and stringref
could be collapsed.
EDIT: also, there are currently redundancies like the need for length, encode etc instructions for both stringref
and associated views.
As a reminder, the purpose of the views is to make certain expensive operations explicit. If the "Java-like language running in a browser" scenario is all you care about, then getting rid of the views is perfectly fine (and since that's the scenario I personally care most about, I wouldn't mind). But I expect that the larger Wasm community cares a lot about non-Java-like languages and non-browser Wasm engines, and to make the various possible combinations of source language's string encoding choice and engine's internal string encoding choice all as efficient (and controllable/predictable) as possible, the views are a crucial tool for which no equivalently-capable alternative has been suggested so far.
Regarding the thread-starting options: I generally don't care much about the text format (because "nobody writes it by hand" aside from test cases; and for debugging, what we have / option (1) seems good enough), so I don't feel strongly about it either way. I'm not a huge fan of the existing approach to shorthands (dataref
and friends), so if it was up to me I'd probably pick option (3). We can always add shorthands later if/when we have data indicating that they would provide tangible benefits.
But I expect that the larger Wasm community cares a lot about non-Java-like languages and non-browser Wasm engine
It seems this aspect is somewhat similar to the larger Wasm community caring a lot about Java-like languages and browser Wasm engines, yet string
turned out to be fundamentally incompatible with Java-like languages and browser Wasm engines, while what is suggested here is that stringref
would sacrifice itself for non-Java-like languages and non-browser environments. I'd like to suggest to be at least consistent, that is either
string
and stringref
indeed target different languages and environments, then each is optimal in its native habitat - or -string
and stringref
should work well for different languages and environments, but then string
needs to be revised.@dcodeIO I don't know what "string
" you're talking about, or what "sacrifice". This proposal, stringref
, is specifically designed to work well for all source languages and all engines (definitely including Java/Kotlin/Dart in the browser, while also scaling to utf8-based source languages and/or utf8-based engines).
From past discussions, I believe you are specifically interested in zero-overhead interop with JavaScript, and I can assure you that that's being provided by the stringref design.
Indeed, having a functioning string type for interop with JavaScript is what I need specifically, but I am equally interested in designing Wasm in a consistent way for many languages and many environments. As such my observation above that we'll eventually have two string types under the same name (the other is Interface Types', now CM, string
), that represent fundamentally different concepts. And with string
being the precedent here, that is the CG has decided that it is fine to have a string type that is fundamentally incompatible with Java-like languages and browser Wasm engines, and given that similar considerations apply to stringref
as well, that leads to the two options above if overall consistency is a goal, which in turn is related to the aspect you mentioned when taking the prior line of thought (the "sacrifice") into account.
I believe you are specifically interested in zero-overhead interop with JavaScript, and I can assure you that that's being provided by the stringref design.
I think this statement needs qualification. In practice, it is only going to be true for languages from the mid/late-90s when UCS-2 still was a thing, or some later ones designed specifically to target/interop with those.
FWIW, option (2) seems preferable to me, though I would suggest renaming stringviter
to stringv32
. I would also suggest making the view/iter functionality more regular (i.e., give them all a simple get
function, drop the redundant next
for iter).
@conrad-watt
Putting operations onto stringref
sounds nicely simple in principle, but how would you specify operations that need indices of any kind? E.g. "substring", or just "get-nth-thing": how would you, even informally, describe the performance expectations for get_nth_utf8_byte
, when the stringref may well be wtf16-encoded under the hood? Or vice versa, how would you make sure that translating a Java-style get_nth_wtf16_codeunit
-based loop to Wasm doesn't become quadratic on an engine that prefers utf8-based string storage internally? And would the end result of having "instruction families" like substring_by_utf8_byte
, substring_by_wtf16_codeunit
, substring_by_codepoint
really be preferable over the stringview based approach?
there are currently redundancies like the need for length, encode etc instructions for both stringref and associated views
I agree that these look redundant at first glance; but even something as simple-sounding as "string length" turns out to be surprisingly complex in practice. I've found this article insightful (long read, but recommended: I for one have learned a lot). Certainly, a Wasm string system could just choose any one of the definitions for "🤦🏼♂️".length and hope that anyone translating their language to Wasm can somehow work with that choice, but that seems... rather over-optimistic to me. The stringviews are a way to offer different use cases what they need, while also minimizing overhead (as far as possible) and maximizing implementation freedom.
@jakobkummerow FWIW your initial comment in this issue prompted me to do some thinking and helped me to be more comfortable with the current design. My only remaining concern would be ensuring that the cost of creating a view at each boundary doesn't get too high, but I agree that analogous and arguably worse concerns can arise in the view-less design (e.g. your get_nth_wtf16_codeunit
concern above).
I hesitate to open this issue, but here goes!
When implementing stringref in V8 I have to butt up against the GC proposal and how stringref fits in. Something like this:
But I notice that it's a bit silly to have the shorthand of everything else (anyref, arrayref, stringref, etc) have a "ref" suffix but not string views.
Also, I think we can all agree that
stringview_wtf8
is not a nice name for a type :)Three possibilities:
(ref string null)
,(ref stringview_wtf8 null)
, etc, using the two-byte formulation.Bikeshed painting time!