Clarity on units of string

WebAssembly / stringref

Other

37 stars 2 forks source link

Clarity on units of string #65

Open Maxdamantus opened 1 year ago

Maxdamantus commented 1 year ago

I'll start by saying that I'm not well-versed in WebAssembly specifications or proposals, but I happened to come across this proposal, and I'm quite interested in Unicode strings and how they're represented in different programming languages and VMs.

This might be just a wording issue, but I think the current description of "What's a string?" could be problematic:

Therefore we define a string to be a sequence of unicode scalar values and isolated surrogates. The code units of a Java or JavaScript string can be interpreted to encode such a sequence, in the WTF-16 encoding form.

If this is taken to mean that a string is a sequence of Unicode code points (ie, "Unicode scalar values" and "isolated surrogates", basically any integer from 0x0 to 0x10FFFF), this does not correspond with "WTF-16" or JavaScript strings, since there are sequences of isolated surrogates that can't be distinctly encoded in WTF-16.

eg, the sequence [U+D83D, U+DCA9] (high surrogate, low surrogate) doesn't have an encoding form in WTF-16. Interpreting the WTF-16/UTF-16 code unit sequence <D83D DCA9> produces the sequence [U+1F4A9] (a Unicode scalar value, 💩). There are 1048576 occurrences of such sequences (one for every code point outside of the BMP), where the "obvious" encoding is already used to encode a USV.

> "\uD83D\uDCA9"
'💩'
> "\u{1F4A9}"
'💩'
> [..."\uD83D\uDCA9"].length
1
> [..."\u{1F4A9}"].length
1

Note that this is different to how strings work in Python 3, where a string is indeed a sequence of any Unicode code points:

>>> "\U0000D83D\U0000DCA9"
'\ud83d\udca9'
>>> "\U0001F4A9"
'💩'
>>> len("\U0000D83D\U0000DCA9")
2
>>> len("\U0001F4A9")
1

If this proposal is suggesting that strings work the same way as in Python 3, I think implementations will likely[0] resort to using UTF-32 in some cases, as I believe Python implementations do (I think Python implementations usually use UTF-32[1] essentially for all strings, though they will switch between using 8-bit, 16-bit and 32-bit arrays depending on the range of code points used). Other than Python, I'm not actually sure what language implementations would benefit from such a string representation.

As a side note, the section in question also lumps together Python (presumably 3) and Rust, though this might result from a misunderstanding that should hopefully be explained above. Rust strings are meant to be[2] valid UTF-8, hence they correspond to sequences of Unicode scalar values, but as explained above, Python strings can be any distinct sequence of Unicode code points.

[0] Another alternative would be using a variation of WTF-8 that preserves UTF-16 surrogates instead of normalising them to USVs on concatenation, though this seems a bit crazy.

[1] Technically this is an extension of UTF-32, since UTF-32 itself doesn't allow encoding of code points in the surrogate range.

[2] This is at least true at some API level, though technically the representation of strings in Rust is allowed to contain arbitrary bytes, where it is up to libraries to avoid emitting invalid UTF-8 to safe code: https://github.com/rust-lang/rust/issues/71033

rossberg commented 1 year ago

Other than Python, I'm not actually sure what language implementations would benefit from such a string representation.

FWIW, Haskell strings also consist of code points:

ghci> "\x1F4A9"
"\128169"
ghci> "\xD83D\xDCA9"
"\55357\56489"
ghci> putStrLn "\x1F4A9"
💩
ghci> putStrLn "\xD83D\xDCA9"
??

Pauan commented 1 year ago

Swift strings are an encoding-independent array of Unicode code points:

https://docs.swift.org/swift-book/documentation/the-swift-programming-language/stringsandcharacters/

But the point of this proposal is that every language benefits from using stringref at the boundaries, even if they don't use stringref internally.

rossberg commented 1 year ago

every language benefits from using stringref at the boundaries, even if they don't use stringref internally.

Not specifically, unless they want to interop with a language with JS-like strings at the boundary. We should assume that embedder languages are as diverse as embedded languages, and consequently, stringref semantics generally doesn't match either side. Imagine a Python3 embedding API for Wasm.

Pauan commented 1 year ago

Not specifically, unless they want to interop with a language with JS-like strings at the boundary.

A stringref is conceptually an array of Unicode scalar values. It does not specify a particular encoding.

As the overview says:

Therefore we define a string to be a sequence of unicode scalar values and isolated surrogates.

This proposal does not require a WebAssembly implementation to use WTF-16 to represent its strings internally.

Internally the Wasm engine can represent a stringref however it wants (ropes, unions, breadcrumbs, etc.), the stringref proposal does not require a particular implementation strategy.

stringref is encoding-agnostic and generic, but it supports operations to convert into / from a variety of different encodings.

Currently the spec supports conversion for WTF-8 and WTF-16 encodings, but it would be easy to add more.

stringref is intended to be a general purpose string type which can be used at the boundary of every language, regardless of the language's internal string encoding.

consequently, stringref semantics generally doesn't match either side. Imagine a Python3 embedding API for Wasm.

The stringref proposal does not require both sides to have the same string encoding. The point is that stringref is an intermediary: languages can convert their internal string type into a stringref, and then pass that stringref to another Wasm module. That Wasm module can then convert the stringref into its own internal string type.

Why is that better than using say... array i8? Because the stringref implementation doesn't specify a particular encoding, the implementation can be optimized to be faster in some situations (perhaps even zero-copy). Using a hardcoded type like array i8 means those optimizations cannot happen.

Maxdamantus commented 1 year ago

Swift strings are an encoding-independent array of Unicode code points:

https://docs.swift.org/swift-book/documentation/the-swift-programming-language/stringsandcharacters/

As far as I can tell, Swift strings are sequences of Unicode scalar values (as mentioned in your link), not code points, so they're comparable to safe Rust strings, but the API hides away the encoding.

eg, it doesn't seem possible to construct a string in Swift containing the code point U+D83D (a high surrogate), but it is possible in Python. "\u{D83D}" in Swift is a compile-time error and String([Character(Unicode.Scalar(0xD83D)!)]) fails at runtime (particularly, Unicode.Scalar(0xD83D) returns nil).

But the point of this proposal is that every language benefits from using stringref at the boundaries, even if they don't use stringref internally.

The point I'm trying to raise in this issue is that as described, the proposal seems to require a new string representation, but it also claims "No new string implementations on the web: allow re-use of JS engine's strings" as one of the requirements.

It's possible that the intention is for stringref to correspond with JS strings, but if that's the case, it should presumably be described as a sequence of UTF-16 code units. Implementations would still be allowed to use alternative encodings (eg, WTF-8, or arrays of 32-bit integers, as long as they treat concatenation of possibly ill-formed string specially).

A stringref is conceptually an array of Unicode scalar values. It does not specify a particular encoding.

The proposal says that it can also have "isolated surrogates", which requires special encoding. As explained above, "isolated surrogates" and "Unicode scalar values" as a base unit of strings is not compatible with UTF-16/WTF-16 (or even strict WTF-8), since there are sequences of isolated surrogates where the obvious encoding is already used by UTF-16 for Unicode scalar values.

Pauan commented 1 year ago

It's possible that the intention is for stringref to correspond with JS strings, but if that's the case, it should presumably be described as a sequence of UTF-16 code units.

No, that is not the intention. externref already exists and can be used for JS strings. If the only goal was to use JS strings in Wasm, then externref already works and fulfills that goal.

The intention is for stringref to be an encoding-agnostic string representation that can be used by every language for interop at the boundaries, while also allowing efficient transferring of host strings (such as JS strings).

This can be accomplished in many ways, it is not necessary to hardcode stringref as WTF-16.

JS engines do not use WTF-16 for JS strings, instead they use multiple different internal representations (Latin1, WTF-16, ropes, etc.)

Therefore, hardcoding WTF-16 for stringref would NOT allow for zero-copy transferring of JS strings. In order to fulfill the goal of zero-copy transferring of JS strings, the internal representation of stringref must not be hardcoded.

Similar to JS engines, it is expected that Wasm engines might choose to have multiple different internal representations for stringref. The Wasm engine will transparently switch between those internal representations as needed.

And when creating views on a stringref, there are many different implementation strategies (such as Swift breadcrumbs). Making it encoding-agnostic means that the Wasm engine can choose whatever strategy works best.

Hardcoding WTF-16 forces copies of JS strings, makes stringref less useful for interop, and prevents those sorts of internal optimizations.

That's why stringref has value beyond just externref, and why it has value beyond a hardcoded encoding like WTF-8 or WTF-16.

The proposal says that it can also have "isolated surrogates", which requires special encoding. As explained above, "isolated surrogates" and "Unicode scalar values" as a base unit of strings is not compatible with UTF-16/WTF-16 (or even strict WTF-8)

Yes, which is why the views are defined as WTF-8 / WTF-16, and not UTF-8 / UTF-16. The views give bytes, not Unicode code points. This is already covered in the overview.

The stringref proposal also provides operators that make it easy for languages to validate that the bytes are valid UTF-8 / UTF-16. Those operators can be made very efficient (which isn't possible with array i8, for example).

Maxdamantus commented 1 year ago

No, that is not the intention. externref already exists and can be used for JS strings. If the only goal was to use JS strings in Wasm, then externref already works and fulfills that goal.

Correct me if I'm wrong, but with externref you'd still need to call into JS to actually access string contents or construct string values, which would likely be relatively expensive (not sure what the SOTA is for inlining JS and WASM code) compared to the native string access operations in this proposal. And of course this assumes a JS host.

Therefore, hardcoding WTF-16 for stringref would NOT allow for zero-copy transferring of JS strings. In order to fulfill the goal of zero-copy transferring of JS strings, the internal representation of stringref must not be hardcoded.

I don't see why this is the case. The contents of the stringref are not held in WebAssembly memory, so the reference can be backed by whatever string representation is available and appropriate.

Looking more closely at the Concatenation section of the proposal, the WTF-16/WTF-8 semantics are already hardcoded into stringref, which I guess suggests that this is really just a wording issue (though there are still conflicting statements relating this to Python semantics).

Similar to JS engines, it is expected that Wasm engines might choose to have multiple different internal representations for stringref. The Wasm engine will transparently switch between those internal representations as needed.

Right. JS implementations already do it, so I'd expect a Wasm implementation could do it too, just as long as it pretends that the string is made up of UTF-16 code units. This doesn't imply that the user needs to see the UTF-16 code units or UTF-16 string offsets, but it has implications around what the set of possible strings is.

Pauan commented 1 year ago

Correct me if I'm wrong, but with externref you'd still need to call into JS to actually access string contents or construct string values, which would likely be relatively expensive (not sure what the SOTA is for inlining JS and WASM code) compared to the native string access operations in this proposal.

Yes, if you want to do string operations on the JS string, then you would need to somehow import the JS string operations. Calls into JS are fast, so that's not a big deal.

And of course this assumes a JS host.

Yes, which is why stringref is nice, because it normalizes the behavior and operations across all hosts, instead of being only for JS.

I don't see why this is the case. The contents of the stringref are not held in WebAssembly memory, so the reference can be backed by whatever string representation is available and appropriate.

If the spec says that the encoding is WTF-16 (and thus provides constant-time access to WTF-16 bytes), then it would have to convert the JS string (which might not be WTF-16) into WTF-16.

If the string representation is left unspecified, then the Wasm engine can do whatever it wants, including just sending over the JS string directly without any conversion.

Looking more closely at the Concatenation section of the proposal, the WTF-16/WTF-8 semantics are already hardcoded into stringref

No it is not hardcoded, the implementation does not need to use WTF-16 bytes for the internal representation. That note is simply saying that when concatenating isolated surrogate pairs, it needs to combine the two isolated surrogates into a single surrogate pair. That does not force the Wasm engine to use a WTF-16 implementation.

You seem to be mixing up the internal representation of stringref with the observable behavior of the stringref operations. They are not the same thing.

Right. JS implementations already do it, so I'd expect a Wasm implementation could do it too, just as long as it pretends that the string is made up of UTF-16 code units.

stringref does not pretend to be any encoding, the encoding is always completely unspecified.

It is the stringref views that pretend to be a particular encoding (such as WTF-8 or WTF-16).

You can have multiple views on the same stringref (e.g. a WTF-8 and WTF-16 view on the same stringref).

When you convert from a stringref into a stringref view, this does have a (potential) performance cost. But after that conversion is done, operations on the stringref view should be fast.

So it is expected that languages will accept a stringref, convert it into a view which matches the language's internal string representation, and then do operations on that view.

e.g. Rust would accept a stringref, convert it into a WTF-8 view, then do operations on that view. But Java would accept a stringref, convert it into a WTF-16 view, and then do operations on that view.

The views is what makes stringref work for every language, because different languages can have different views of the same stringref.

Maxdamantus commented 1 year ago

You seem to be mixing up the internal representation of stringref with the observable behavior of the stringref operations. They are not the same thing.

I've assumed that when you've talked about "hardcoding", you're still only talking about the specification. The implementation is obviously allowed to do whatever it wants as long as it matches the behaviour described by the specification. This includes using alternative underlying string representations, as already happens in JavaScript implementations, where the ECMAScript specification describes strings as being sequences of 16-bit integers.

It seems that semantically, this proposal is suggesting that stringrefs be equivalent (ie, have a one-to-one correspondence) to JS strings. The Concatenation section is specified such that concatenation of stringrefs is equivalent to concatenation of the corresponding JS strings. The "What's a string?" section however does not clearly state this, and it in fact sounds like it's stating something else, as I explained in my initial post.

I think it should be fairly obvious that a stringref implementation can decide to use WTF-8 internally, in which case if a user requests a WTF-8 view using string.as_wtf8, there need be no UTF-8/UTF-16 conversion, but this really seems like an implementation matter, not a specification matter. It seems like use of WTF-8 should be irrelevant at a specification level except when iterating a WTF-8 view, where the implementation needs to convert from whatever internal representation into WTF-8 (if the implementation happens to be using WTF-8 as the underlying representation for this string, great, this conversion is a no-op).

Note that well-formed WTF-8 strings correspond directly to potentially ill-formed UTF-16 strings (aka, WTF-16 strings). WTF-16 and well-formed WTF-8 both represent an array of 16-bit integers, and it seems like it should be up to an implementation to choose which one (if any) to use for the underlying string representation, but presumably it would be simpler if the specification clearly stated that the string is a sequence of 16-bit integers.

Pauan commented 1 year ago

I've assumed that when you've talked about "hardcoding", you're still only talking about the specification.

Yes, and the point of the spec is that the encoding of stringref is not specified. If it was specified, then that would be hardcoding.

Because WebAssembly is very low level (at a similar level as CPU bytecode or VM bytecode), WebAssembly specs make certain guarantees about behavior, performance, and the exact layout of memory.

The stringref proposal intentionally does not make guarantees about the exact memory layout of a stringref, instead it makes guarantees about the behavior and performance of stringref views.

It seems that semantically, this proposal is suggesting that stringrefs be equivalent (ie, have a one-to-one correspondence) to JS strings. The Concatenation section is specified such that concatenation of stringrefs is equivalent to concatenation of the corresponding JS strings.

Internally, no, it does not require that. It only requires that when concatenating isolated surrogates, they are combined into a surrogate pair.

That does not impose any restrictions on the internal implementation of stringref, because that behavior can be trivially accomplished with any internal encoding.

For example, let's suppose that a WebAssembly engine chooses to implement stringref with WTF-8. And let's suppose that the WebAssembly program concatenates these two stringrefs together:

"\uD801"
"\uDC37"

Because it uses WTF-8 internally, those two stringrefs are represented with the following byte sequences:

00ed 00a0 0081
00ed 00b0 00b7

And when concatenating, the new stringref will have the following byte sequence:

00f0 0090 0090 00b7

This behavior is unambiguous and it follows naturally from the behavior of WTF-8 and surrogate pairs.

But this is different from WTF-16, which would simply concatenate the two byte sequences directly:

d801 dc37

I think it should be fairly obvious that a stringref implementation can decide to use WTF-8 internally, in which case if a user requests a WTF-8 view using string.as_wtf8, there need be no UTF-8/UTF-16 conversion, but this really seems like an implementation matter, not a specification matter.

Correct, which is why the specification leaves the encoding of stringref unspecified, to allow those sorts of implementations.

Note that well-formed WTF-8 strings correspond directly to potentially ill-formed UTF-16 strings (aka, WTF-16 strings).

It corresponds conceptually, but not in terms of actual memory layout. WebAssembly is low level, WebAssembly programs can inspect and modify individual bytes, WebAssembly programs and WebAssembly engines care about memory layout.

That's why the stringref spec does not specify a particular memory layout, instead the engine has the flexibility to implement stringref however it wants (as long as it fulfills the requirements in the spec).

it seems like it should be up to an implementation to choose which one (if any) to use for the underlying string representation, but presumably it would be simpler if the specification clearly stated that the string is a sequence of 16-bit integers.

Yes it is up to the implementation to decide its internal representation, and no it would not be simpler or clearer if the spec said that a string is a sequence of 16-bit integers, because that would restrict the internal implementation.

It is the views that provide guarantees about encoding, and there are multiple views, there is not a single "canonical" view.

rossberg commented 1 year ago

The point is that stringref is an intermediary: languages can convert their internal string type into a stringref

The aforementioned languages cannot, because as we saw from the counter examples, they distinguish string values that stringref can't distinguish.

The only canonical choice of intermediary would be well-formed Unicode. But stringref relaxes to WTF, which specifically covers additional string values that JS and friends can represent, but not those that some other languages have. AFAICS, that is a language-biased choice, not a neutral one.

It only requires that when concatenating isolated surrogates, they are combined into a surrogate pair.

Which does not fit the semantics of concatenation demanded by Python3 or the other languages mentioned. So again that is a biased semantics.

because that would restrict the internal implementation.

That's not correct. An implementation is always free to do whatever it wants internally as long as its observable behaviour matches the specification. And the prescribed observable behaviour arguably is most directly specified in terms of WTF-16 code units.

wingo commented 1 year ago

Really good catch, @Maxdamantus !

So if we were to define an abstract string data type in WebAssembly, we have a few options. We could say:

Strings are sequences of USVs
Strings are sequences of USVs and isolated surrogates
Strings are sequences of codepoints
Strings are sequences of something more general than that

I made a thinko in e.g. https://wingolog.org/archives/2023/10/19/requiem-for-a-stringref, considering that (3) was the same as (2). My mistake.

Backing up a bit then, is there even a reasonable choice to make here? Maybe not -- ultimately this is a question of taste and consensus and tradeoffs -- but I think that if we were forced to choose, (2) is the right option. In my humble opinion, emphasis on the H part here—happy to be corrected—Python's choice of (3) is actually a bug which could (should?) be fixed over time, and which does not need to be accommodated by a standard Wasm string data type.

My reasoning is that surrogate codepoints proceed from two places only (happy to be corrected here too): a. Treating strings as buffers of raw data b. Splitting valid UTF-16 at unfortunate places

Surrogates don't come across the wire, because they are not part of a Unicode encoding scheme: they can't be expressed in UTF-16 or UTF-8. So you can only create them in your program, and there are only these two ways that it happens.

In the case of Python 3, neither case applies: there are no UTF-16 surrogate pairs to split, and you wouldn't use a Python 3 string as raw data storage.

rossberg commented 1 year ago

@wingo, as far as I am concerned, isolated surrogates are a bug in practice no matter how they appear in a text string. The difference between (2) and (3) is that the former covers the class of bugs arising in languages of the 16-bit Unicode era, while the latter covers the cases arising in languages from the 32-bit Unicode era.

Both come from exposing the respective underlying encoding and giving read/write access to raw code units. Neither is more likely to be "fixed" in future versions of the corresponding languages. Moreover, down the road, new languages are more likely to be in camp (3) than in camp (2). (Hopefully, they'll all be in camp (1) instead.)

From my perspective, there are two defendable design choices for a "universal" Unicode string semantics (ignoring (4), which could be raw 32-bit values):

a. Support the intersection of possible values – that would be (1), maximising utility for well-defined language interop. b. Support the union of possible values – that would be (3), maximising utility for language-internal usage.

In contrast, (2) is neither here nor there, and combines the semantical disadvantages of both.

Maxdamantus commented 1 year ago

I've been wondering about another approach which seems simpler interfacially, not sure if it's already been discussed:

Just add three string types, string8ref, string16ref, string32ref and some operations for converting between them. Conversion between the different string types will replace errors with replacement characters. Each language simply picks whatever its natural string type is.

I feel like it should be feasible for an implementation to essentially create zero-copy stringXref values that are similar to views in this proposal, where eg, a string8ref might be backed by UTF-16 data from some JS string or a string16ref might be backed by UTF-8 data from some Rust string (and this view could be exposed even in JS as a regular string value).

If something like this were added, I would expect for example that converting from an ill-formed string8ref to string16ref then back to string8ref would create a well-formed string8ref, equivalent to the original string but with the errors replaced. It can still be backed by the original invalid bytes, so no copy is required.

Also note that for strings containing only ASCII data, this conversion is likely to be a no-op, since ASCII strings use the same code units and same length in UTF-8, UTF-16 and UTF-32.

The main implementation issues I see with this approach for non-ASCII strings are computing length and indexing of the zero-copy converted strings, since those would potentially be linear operations. This might be solvable by simply computing length and/or breadcrumbs lazily. I think it would still make sense to have string8.advance, string16.advance, string32.advance which would allow iteration through the code units without forcing computation of length or breadcrumbs (note that string32ref is useful if you want to iterate through code points, like stringview_iter in the current proposal).

Also regarding Python's use of code points as the unit of strings, I don't think it's possible for them to fix this[0], especially given that they make use of these ill-formed strings for encoding ill-formed UTF-8 data (note that this is similar in principle to how WTF-8 uses ill-formed UTF-8 to encode ill-formed UTF-16 data).

[0] At least not without a major Python version. I've long considered the change between Python 2 and Python 3 to be a mistake (I think this opinion is becoming less and less controversial), and PEP-0383/surrogateescape is one piece of evidence for this (my understanding is that they added this some time after Python 3 was released when they realised that actually you often do just need to pass the bytes along in a string, but this raises the question of why one ill-formed string is better than the other ill-formed string).

wingo commented 1 year ago

Just a note on https://peps.python.org/pep-0383/: this strategy appears to only ever produce high surrogate codepoints. These strings are representable in option (2).

Maxdamantus commented 1 year ago

Surrogates don't come across the wire, because they are not part of a Unicode encoding scheme: they can't be expressed in UTF-16 or UTF-8. So you can only create them in your program, and there are only these two ways that it happens.

Note that they can feasibly come across the wire in the form of JSON. JSON.stringify("\uD800") nowadays produces a well-formed string (can be converted to UTF-8) that represents a UTF-16 surrogate.