Switch AssemblyScript to UTF-8 by default?

dcodeIO commented 3 years ago

Given the amount of foregoing heated discussions on the topic, especially in context of Interface Types and GC, I am not getting the impression that anything of relevance is going to change, and we should start planning for the worst case.

So I have been thinking about what would be the implications of switching AssemblyScript's string encoding to W/UTF-8 by default again, and that doesn't look too bad if all one really wants is to get rid of WTF-16, is willing to break certain string APIs, wants it to be efficient after the breakage and otherwise is not judging.

Implications:

String#charCodeAt would be removed in favor of String#codePointAt
String#charAt would be removed, or changed to retain codepoint boundaries if viable
String#[] would be removed, or changed to retain codepoint boundaries if viable, or to return bytes numeric like C
String#length would return the length of the string in bytes
Sting.fromCharCode would be removed, or deprecated and polyfilled
String#split with an empty string separator would split at codepoints
Ill-formed Unicode would be rejected
- if it can be done efficiently
- if not, we'd have to think about WTF-8 instead
Anything returning a character offset before would return a byte offset after:
- Most String APIs would "just work" with byte offsets instead of character offsets as well
- Mileage may vary if one uses string APIs with constant (incremented) offsets, as that would not map well anymore
- Example: The compiler's tokenizer would need to skip codepoints instead of += 1

Means we'd essentially jump the UTF-8 train to have

efficient calls to WASI APIs
efficient calls to DOM APIs typically accessed with 7-bit ASCII strings (think .className = "abc)
the same problem as everyone else where 7-bit ASCII is not enough

Note that the proposition of switching AS to UTF-8 is different from most of what has been discussed more recently, even though it has always been lingering in the background. Hasn't been a real topic so far due to the implied breakage with JS, but unlike the alternatives it can be made efficient when willing to break with JS. Certainly, the support-most-of-TypeScript folks may disagree as it picks a definite site.

If anything, however, we should switch the entire default and make a hard cut because

maintaining two string implementations, and ensuring that all APIs work with both, is not exactly realistic
maintaining a single string implementation understanding both encodings would yield the problem we are trying to avoid in Wasm, but in AS
any of the above would often double code size of string operations

Thoughts?

Qix- commented 3 years ago

There are solutions to the .length problem. As I mentioned before, complete compatibility with WTF-16 from a UTF-8 perspective would require performance overhead. That's unavoidable, but for some it might be preferable.

Use compat types for String and the like, and offer libraries to use always-available String8 and String16 types provided by AssemblyScript.

Make String map to String16Compat or something so that it works correctly across all encodings, and so that under utf-8 mode .length takes a performance hit as it will have to calculate the string length on the fly. Or cache the length under the hood if you'd like.

I'm sure there are a handful of similar great ideas. But it is impossible to get both worlds at once. If that is a hard requirement, then AssemblyScript is doomed. A tradeoff is necessary.

My main criticism of this discussion at the meta level, however, is that it seems the committee is unwilling to make a tradeoff here where one is clearly necessary. There is no incredible, best solution here. All of the problems of interop between the two have been cleanly laid out. It's just politics at this point, about which approach is the least obtrusive.

Beyond what I've said I don't think I can contribute much else.

protheory8 commented 3 years ago

The upcoming Interface Types should bring another abstraction that will allow to interop data between wasm modules built in different languages without having to consider their specific circumstances, including the format of strings used in those languages.

I thought AssemblyScript was not going to implement Interface Types proposal?

farteryhr commented 1 year ago

it's THE problem all kinds of programming languages face.

my suggestions aiming "least war": (in other word, most function provided)

of course only generate code when actually used.
different string types (let's follow i32 u32, use utf8 utf16 utf32. there's no way to hide. i don't think it's good to hide something here (especially with dynamic encoding like python) when we're on "assembly".)
bracket indexing (by char code)
all kinds of iterators (all of: by code point (both variations: yielding the string, yielding the code point value) and by grapheme cluster, all also yielding codePointIndex) (btw #2254)
.nthCodePoint(n) .nthGraphemeCluster(n) implying it scans.
.startsWith .endsWith .split .replaceetc. accepting only argument of its own kind, [last]indexOf that returns char code index, all three fromCodePoint and fromCharCode with check
conversion from/to utf32 and Uint32Array in one line (because the majority won't like to bother using more memory when they get to this step)
interop with javascript in utf16 obviously, check utf16 validity when argument passed in since javascript allows them .

(if no checking is desired, replace all utf with wtf)

AssemblyScript / assemblyscript

Switch AssemblyScript to UTF-8 by default? #1653