Open dcodeIO opened 3 years ago
There are solutions to the .length
problem. As I mentioned before, complete compatibility with WTF-16 from a UTF-8 perspective would require performance overhead. That's unavoidable, but for some it might be preferable.
Use compat types for String
and the like, and offer libraries to use always-available String8
and String16
types provided by AssemblyScript.
Make String
map to String16Compat
or something so that it works correctly across all encodings, and so that under utf-8 mode .length
takes a performance hit as it will have to calculate the string length on the fly. Or cache the length under the hood if you'd like.
I'm sure there are a handful of similar great ideas. But it is impossible to get both worlds at once. If that is a hard requirement, then AssemblyScript is doomed. A tradeoff is necessary.
My main criticism of this discussion at the meta level, however, is that it seems the committee is unwilling to make a tradeoff here where one is clearly necessary. There is no incredible, best solution here. All of the problems of interop between the two have been cleanly laid out. It's just politics at this point, about which approach is the least obtrusive.
Beyond what I've said I don't think I can contribute much else.
The upcoming Interface Types should bring another abstraction that will allow to interop data between wasm modules built in different languages without having to consider their specific circumstances, including the format of strings used in those languages.
I thought AssemblyScript was not going to implement Interface Types proposal?
it's THE problem all kinds of programming languages face.
my suggestions aiming "least war": (in other word, most function provided)
i32 u32
, use utf8 utf16 utf32
. there's no way to hide. i don't think it's good to hide something here (especially with dynamic encoding like python) when we're on "assembly".).nthCodePoint(n) .nthGraphemeCluster(n)
implying it scans..startsWith .endsWith .split .replace
etc. accepting only argument of its own kind, [last]indexOf that returns char code index, all three fromCodePoint
and fromCharCode
with checkutf16
obviously, check utf16 validity when argument passed in since javascript allows them .(if no checking is desired, replace all utf with wtf)
Given the amount of foregoing heated discussions on the topic, especially in context of Interface Types and GC, I am not getting the impression that anything of relevance is going to change, and we should start planning for the worst case.
So I have been thinking about what would be the implications of switching AssemblyScript's string encoding to W/UTF-8 by default again, and that doesn't look too bad if all one really wants is to get rid of WTF-16, is willing to break certain string APIs, wants it to be efficient after the breakage and otherwise is not judging.
Implications:
String#charCodeAt
would be removed in favor ofString#codePointAt
String#charAt
would be removed, or changed to retain codepoint boundaries if viableString#[]
would be removed, or changed to retain codepoint boundaries if viable, or to return bytes numeric like CString#length
would return the length of the string in bytesSting.fromCharCode
would be removed, or deprecated and polyfilledString#split
with an empty string separator would split at codepointsMeans we'd essentially jump the UTF-8 train to have
.className = "abc
)Note that the proposition of switching AS to UTF-8 is different from most of what has been discussed more recently, even though it has always been lingering in the background. Hasn't been a real topic so far due to the implied breakage with JS, but unlike the alternatives it can be made efficient when willing to break with JS. Certainly, the support-most-of-TypeScript folks may disagree as it picks a definite site.
If anything, however, we should switch the entire default and make a hard cut because
Thoughts?