WebAssembly / stringref

Other
37 stars 2 forks source link

Is stringref a subtype of eqref? #20

Open eqrion opened 2 years ago

eqrion commented 2 years ago

If so, how do we determine the identity of strings that come from a JS host so that we can perform ref.eq?

My understanding is that JS strings don't have an identity that can meaningfully be spoken of. Implementations have flexibility to canonicalize and mess around with which string operations create a new pointer value.

wingo commented 2 years ago

I think that you would agree that it would be an error if we allowed WebAssembly to distinguish strings that can't be distinguished on the JS level -- that "foo" == "foo" to JavaScript, but (ref.eq "foo" "foo") -> 0 to WebAssembly. I was expecting that people would just use string.eq on strings and that strings would thus not be a subtype of eqref. Does that sound about right to you @eqrion ?

jakobkummerow commented 2 years ago

To add to that:

(+1 to stringref not being a subtype of eqref. string.eq will work for comparing JS-created strings with Wasm-created strings, with "content comparison" semantics like JS's mystring == "foo".)

eqrion commented 2 years ago

Yes, not having a stringref <: eqref and relying on string.eq would be great. Just wanted to get this clarified.

titzer commented 2 years ago

+1 to stringref not having identity.

rossberg commented 2 years ago

I don't think it's that easy. And I'm highly uncomfortable with the "because JavaScript" argument. For better or worse, there are languages that have strings with identity. Should we punish them?

jakobkummerow commented 2 years ago

@rossberg in what way do you feel that anything is getting "punished"? string.eq works for anyone, does it not?

rossberg commented 2 years ago

@jakobkummerow, having string identity implies the existence of an operation to actually observe it by comparing string identity. And that's expected to be a constant time operation.

wingo commented 2 years ago

My experience in Scheme (gosh I am an old person now) is that languages in which strings have identity lead to unspecified behavior and thus users shying away from using the feature -- pointer equality over strings then becomes a language gotcha or misfeature. Like whether (eq? "str" "str") is true or false -- it's explicitly not specified, and might even have different results in an interpreter versus a compiler, even in the same implementation. It's a footgun that people learn to avoid.

Just for information, @rossberg do you know of languages with immutable strings in which strings have identity?

Pauan commented 2 years ago

@wingo do you know of languages with immutable strings in which strings have identity?

JavaScript, PHP, Lua, and Julia.


@rossberg String equality varies quite a lot between languages. Most languages do not do 100% string interning, and so string equality is not constant time. It's quite rare for a language to have guaranteed O(1) string equality.

If ref.eq works on strings, then that implies that strings are always 100% interned, which imposes strict limits (and performance costs) on the host environment.

rossberg commented 2 years ago

@Pauan, JS strings do not have identity. I'm not sure about the others.

But you can e.g. add Scheme and OCaml to that list. For OCaml, = and == test structural and reference equality, respectively. Like for Scheme, the latter is underspecified and thus not generally recommended on strings except for algorithmic optimisations, where it is used. (Moreover, == is both ubiquitous and polymorphic, thus a compiler could hardly afford a value representation that is not a subtype of eqref.)

OO languages like Java typically also have string objects with identity, though it's unclear whether they would be able to implement those using plain stringrefs in the first place.

That structural equality is not O(1) is exactly my point. That makes it no substitute for reference equality.

Pauan commented 2 years ago

@rossberg JS strings have identity in the sense that there isn't any way to do a pointer comparison in JS, the only identity operator is ===, which does behave like an identity function. Identity isn't based on whether the operation is O(n) or not, it's based on its semantic behavior. Semantically, === in JS does behave like string identity.

I specifically excluded languages like Scheme, Python, Java, etc. because they do have ways to distinguish strings at the pointer level. Which means that two equal strings can have different pointers, and so their identity is muddled.

The definition of identity I am using is the same one used by the famous egal paper. For immutable objects, the only sane behavior is that identity is structural equality.

That structural equality is not O(1) is exactly my point. That makes it no substitute for reference equality.

Yes, but the only way to get O(1) reference equality with strings is to intern every single string, at runtime, after every string operation. Lua does that, but most languages don't.

That's a very severe performance cost, especially for large strings. As I said before, a language having O(1) string equality is very rare. I believe even JavaScript doesn't have guaranteed O(1) string equality.

If ref.eq exists for strings, then that forces all Wasm hosts to implement interning for every string operation.

I think it would be much better to have a separate intern function which can be manually called to intern a string, instead of forcing every string to be interned.

rossberg commented 2 years ago

@Pauan, I see, but usually folks say values have identity to mean that you can observe "pointer equivalence" between them. While the intent of the egal paper is of course correct (for a user-facing language!), their way of redefining the meaning of "identity" itself is perhaps less than helpful in this discussion. But if you prefer, s/identity/reference equality/g in all I (and others) said above. :)

If ref.eq exists for strings, then that forces all Wasm hosts to implement interning for every string operation.

Not all hosts, it's only an issue for a host language like JS that does not have reference equality on strings but simultaneously expects "seamless" interop. And even then we could under-specify the behaviour at the boundary, if that's what it takes.

Don't forget that our primary customers are languages compiling to Wasm, not host languages, so we ought to meet their needs first of all.

Pauan commented 2 years ago

@rossberg usually folks say values have identity to mean that you can observe "pointer equivalence" between them.

When you say "pointer equivalence" do you mean that equal strings will also be equal pointers? Or do you mean like in Python/Scheme/Java/etc. where the pointers aren't equivalent even if the string is structurally equal?

Don't forget that our primary customers are languages compiling to Wasm, not host languages, so we ought to meet their needs first of all.

Of course, but we are talking about a host string type. The stringref type is primarily intended to communicate with the host (and with other Wasm modules which accept stringref).

When it comes to a language's internal string type (like a Java String, or a Python string) they will likely continue to use linear memory (or the GC proposal) in order to get exactly the semantics that they want.

It sounds like you are expecting most languages to use stringref even for their internal string type, but it's unlikely that a stringref would have the exact right behavior, API, and performance guarantees. So I expect that in practice most languages will convert from stringref into their internal string type.

So that leads to a question: are there any languages which would desire to use stringref for their internal string type, and they are only capable of using stringref iff stringref supports ref.eq?

it's only an issue for a host language like JS that does not have reference equality on strings

But JS is expected to be the most common host (thanks to the web), and so I don't think we should ignore the JS host concerns either.

Pauan commented 2 years ago

By the way, if a language wants to use stringref internally, and it requires ref.eq, couldn't it wrap the stringref in a GC struct? Then it would use ref.eq on the wrapper struct, and string.eq on the inner stringref.

That would allow the language to fully support ref.eq on strings without requiring the host to implement ref.eq for stringref. This also gives the language complete control over how strings are compared, and also complete control over interning. I think that level of control will be needed for many languages.

jakobkummerow commented 2 years ago

OO languages like Java typically also have string objects with identity, though it's unclear whether they would be able to implement those using plain stringrefs in the first place.

Luckily we'll find out soon: this proposal's design specifically took Java's needs into account, and Java (in the form of the J2Wasm compiler) is the first user (as far as I'm aware) to target stringrefs.

if a language wants to use stringref internally, and it requires ref.eq, couldn't it wrap the stringref in a GC struct?

Exactly.

rossberg commented 2 years ago

When you say "pointer equivalence" do you mean that equal strings will also be equal pointers? Or do you mean like in Python/Scheme/Java/etc. where the pointers aren't equivalent even if the string is structurally equal?

Pointer equivalence is just equality on pointers. It's fully agnostic to the contents of the objects it is pointing to and does not imply anything about it (other than equal pointers implying equal contents).

Don't forget that our primary customers are languages compiling to Wasm, not host languages, so we ought to meet their needs first of all.

Of course, but we are talking about a host string type. The stringref type is primarily intended to communicate with the host (and with other Wasm modules which accept stringref).

I don't think that is a useful characterisation. If holding host strings was the sole purpose, then externref would already be a perfectly fine type for that.

The point of stringref is to enable sharing the same string representation between program and host, so that communicating it is cheap. Naturally, this only has any benefit if the compiled language is able to use it for its own strings. Otherwise you'd just shift the cost of boundary copies from the engine to the language runtime. That would make the whole proposal moot, AFAICS.

But JS is expected to be the most common host (thanks to the web), and so I don't think we should ignore the JS host concerns either.

With everything else being equal, I agree we should accommodate JS. But when everything else is not equal, JS interop has lower priority.

By the way, if a language wants to use stringref internally, and it requires ref.eq, couldn't it wrap the stringref in a GC struct?

Then every string would require twice the allocations and every access would be twice as expensive (and unlike with what we currently accept e.g. for GCed array objects, there would be no way forward out of that). I doubt that languages would be interested in making this choice, in particular, since exposing physical equality is primarily a means for optimising code, so requiring a more expensive implementation to support it would be self-defeating.

Pauan commented 2 years ago

Has there been any thought about pointer-equality exposing internal details of the host, and thus could be a security leak (or enable a side-channel for communication)?

@rossberg If holding host strings was the sole purpose, then externref would already be a perfectly fine type for that.

I didn't say sole purpose, just the primary purpose. Many languages simply cannot use stringref at all for their internal string type (for example, C, C++, Rust, etc.) And even if a language would like to use stringref, it might need more control over the API and performance guarantees.

So if the goal is for stringref to replace most language's internal string type, I think that's a much larger proposal than what we have right now. Perhaps that discussion can be delayed to a future proposal?

Then every string would require twice the allocations and every access would be twice as expensive (and unlike with what we currently accept e.g. for GCed array objects, there would be no way forward out of that).

I think languages will need to bear that cost regardless, because many languages attach extra metadata to strings (internal object slots, caching, methods, etc.)

So which languages do you think would be able to adopt stringref for their internal string type without any sort of extra wrapping? And which of those languages absolutely require ref.eq for stringref?

eqrion commented 2 years ago

Has there been any thought about pointer-equality exposing internal details of the host, and thus could be a security leak (or enable a side-channel for communication)?

I don't know about a security leak, but having stringref expose pointer-equality would definitely expose internal details of current web engines that have never been exposed before, and are likely very different. From discussions with JS engine folks at Mozilla, it's not something we could reasonably implement and is likely a non-starter for us.

I think there are two mismatches we have to choose from:

  1. Have host VM's (such as JS) that don't support pointer-equality of strings support pointer-equality in stringref
  2. Have languages (such as C#) that support pointer-equality of strings use stringref without pointer-equality for their strings

(1) seems unlikely to work due to the above implementation detail concerns. I wonder if (2) is easier to support by having these languages fallback to using string.eq as a proxy for reference-equality? And if that doesn't work, they could wrap their strings in a GC object that supports ref.eq.

rossberg commented 2 years ago

@Pauan:

@rossberg If holding host strings was the sole purpose, then externref would already be a perfectly fine type for that.

I didn't say sole purpose, just the primary purpose. Many languages simply cannot use stringref at all for their internal string type (for example, C, C++, Rust, etc.) And even if a language would like to use stringref, it might need more control over the API and performance guarantees.

Right, and that's one of the key problems. With my apologies to @wingo, but it seems to me that this proposal has not yet made up its mind. Either it extends core Wasm to serve language implementations, but then it needs to serve a sufficiently broad range of languages and be more flexible than what's currently offered (for example, enable ref equality, enable constant-time random byte access, a predictable cost model, etc). Or it is only meant to serve the interaction with specific hosts, but then it is out of place in the core language and the functionality should somehow be moved to the respective API.

wingo commented 1 year ago

My goodness, I just ran into this for a Scheme compiler; I am using (ref eq) as my uniform representation and it is annoying that I will have to wrap stringref in a struct, when Scheme doesn't even specify very much about when strings need to be eq? or not eq?. For once, I am wanting to agree with @rossberg, that it would be nice to allow Wasm to distinguish strings that JS cannot!