Open dcodeIO opened 3 years ago
You know my opinion. I agree with everything except the point that it should be the default and suggest creating a new string class using UTF-8 and more stricter / limited API with a shorter name str
. Also add conversions on language level for str
<-> string
The problem I see with maintaining two string classes is that we are essentially making WebAssembly's problems ours, as part of the language. Of course providing say a separate String8
class is not necessarily bad, just having it is fine, but any attempts to integrate it with the rest of the standard library are most likely doomed from the start. To make proper use of that second class with external APIs for instance, without inflicting the burden to re-encode before the final destination of the string, we would have to provide say two abstractions around WASI, or the DOM, or whatnot. Plus, everyone building a library for AS would have to do the same if they wanted to support both as input without re-encoding where possible. As far as I am concerned that's not feasible.
Apologies if anything I say is irrelevant to this, or if I've misunderstood anything.
https://github.com/WebAssembly/interface-types/issues/13#issuecomment-769289863 https://github.com/WebAssembly/interface-types/issues/13#issuecomment-770271864
According to the interface-types proposal itself, UTF-16 will not be at any performance disadvantage compared to UTF-8 or even ASCII.
Btw, WTF-8 shouldn't be a viable target, as out of range Unicode characters would trap, which UTF-8 does not enforce.
String#[]
would be removed, or changed to retain codepoint boundaries if viable, or to return bytes numeric like C
This could simply be made an O(n) operation by iterating. Unfortunate, but it would keep TS/JS compatability. (also, unindexable strings lose a whole lot of value)
The performance disadvantage I worry about is when chaining modules. Say for instance you have a webapp written in Rust, a form validator written in AssemblyScript, a RegEx engine written in C++ and a database written in Java. Then a user inputs "something".
webapp: utf8
to validator: utf8->wtf16
to regex engine: wtf16->utf8
capture groups to validator: utf8->wtf16 (for each)
to database: utf8->wtf16
store to file: wtf16->utf8
Now that's kinda a constructed example, but it helps to illustrate the general problem we'll eventually be dealing with. At any point, in any module graph, there may be an AssemblyScript module in between two Rust modules or vice-versa, just passing a string forward while not even necessarily doing something with it, producing two unnecessary re-encodings on the string's way through. My expectation is that this will only become worse once Interface Types become available, as it makes small modules more viable while currently most binaries are monoliths because it isn't available.
To fully solve the problems of true polyglot WebAssembly, we'd either need
@dcodeIO just passing a string forward while not even necessarily doing something with it, producing two unnecessary re-encodings on the string's way through.
That is not the case, because of laziness and adapter fusing it is possible to optimize away the encoding. This technique is already used extensively in Haskell and Rust, so it should be possible to apply it to Wasm engines as well.
Note that because interface adapters work lazily, that means they are basically like a JavaScript Iterator, so it does not require intermediate memory.
Encoding might still be necessary if the intermediate modules use different encodings and need to actually use the strings... but that is not a Wasm-specific problem... that is just unavoidable in general unless all the modules use the same encoding.
Interface types smooth things over by providing a universal encoding-agnostic exchange format (list char
) which is lazily computed and fused (to avoid intermediate copies).
Note that the most recent interface types proposal is far more general than your Universal Strings proposal, since list char
can handle any encoding efficiently without intermediate copies, there are no hardcoded encodings.
I do agree that the current state of IT is based on a clever idea, I'd just not agree that it leads to significant improvements in practice. For instance, as soon as a string goes into a linear memory based language, the string goes into memory more often than not. So, unfortunately, whenever that happens, or the string is otherwise "used", there's re-encoding, leading us to
that is just unavoidable in general unless all the modules use the same encoding
as the common case (also in the future even more because the proposed canon encoding effectively biases towards linear memory languages), which is unfortunate for AssemblyScript, and this issue is about. Let me know if you'd agree, but I'd appreciate if we could discuss the more general challenges within the other issue.
For instance, as soon as a string goes into a linear memory based language, the string goes into memory more often than not.
Yes, but unfortunately there's no way around that, I am unaware of any technical solution other than "languages must agree on the same representation format if they want to pass strings without any performance penalty". So there are only two solutions:
You force all languages to use a Wasm-specific GC string type (which means languages cannot use their own native string types, which means you can't use native string APIs or libraries, since they won't work with the Wasm-specific type).
Or you accept that converting to/from a language's native format incurs some minor cost (though that cost is mitigated and minimized in various ways).
This problem isn't just limited to strings, it also happens when passing records/tuples/lists between languages. It's unfeasible to require all languages to have the same string/list/tuple/record representation, since Wasm is not an abstract runtime like the JVM, and it does not try to be like the JVM.
Also don't forget that Wasm is intentionally designed around a shared-nothing model, where modules do not share their memory with each other, so each module has a separate linear memory. So a certain amount of copying is unavoidable with that model (even with lift_canon
).
GC fixes that copying problem, but only for languages which choose to use GC (and which have the same GC representation). So non-GC languages will still need to use the adapter system forever.
And so interface types gives a solution which works for linear memory now, and which can be upgraded to GC later, and which supports interop between GC and non-GC languages.
the proposed canon encoding effectively biases towards linear memory languages
Why do you say that? Yes interface types are intentionally designed so they can be used without a GC (and thus work well for "linear memory languages"), but interface types also work with GC, and GC strings can be added in the future, they are orthogonal features.
If anything, it's biased in the opposite way: GC languages will be able to use GC strings (and thus can pass strings with no performance penalty, since they're just passing pointers), whereas "linear memory languages" are always forced to copy the full string into linear memory (even with lift_canon
/lower_canon
), and thus are at an inherent disadvantage.
GC has the same problem if languages use different encodings, or somehow different object layouts, or otherwise require side-effects when constructing a string, yeah. The story just continues forever, unless we start to at least think about alternatives. Universal Strings is my attempt to come up with an idea that minimizes the problem by introducing an encoding-agnostic universal string type. It has a graceful fallback for non-GC languages (plus a graceful outlook to define a wasm::string
for example in Rust), and solves some problems IT does not. Of course it has its own problems, but that's the alternative we would have to at least talk about if converging all languages to UTF-8 isn't an option. I don't see anyone discussing that, so here we are.
Other than that, I cannot express how tired I am of repeating myself while achieving exactly nothing. Probably not an unexpected outcome.
Universal Strings is my attempt to come up with an idea that minimizes the problem by introducing an encoding-agnostic universal string type.
That proposal is basically just GC strings, which are great, but that doesn't replace interface types, and GC strings can be layered on top of interface types.
Since AssemblyScript is supposed to mirror TypeScript, I think it makes sense for it to use GC strings (when they become available). And in the meantime I don't think it matters too much what encoding is used.
The only benefit of choosing UTF-8 is that it can be slightly more efficient (if lift_canon
and lower_canon
are standardized), but that advantage is temporary and will go away after GC strings exist.
You could also choose to make AssemblyScript encoding-agnostic (like Python), so the encoding would be an internal implementation detail which is hidden from the user. The idea is that (from the user's perspective) strings would just be arrays of code points.
if converging all languages to UTF-8 isn't an option.
Converging to a single GC string type (e.g. Universal Strings) is also not an option. You're never going to convince all languages to converge toward a single string representation. Languages will have different string representations, that's just a fact we have to accept, a JVM-style universal representation is not feasible.
This isn't about Rust, or C++, it's about all languages. There are dozens of ways of handling strings, and each language makes a different choice:
This problem is a lot trickier than choosing between UTF-8 and UTF-16 (or choosing between linear memory and GC). Which is why interface types are designed to be shared-nothing, to support all encodings, and to support different representations. This has some performance costs, but it's the only feasible way to support different languages.
Perhaps we can agree on the observation that Interface Types is a beautiful solution to a subset of the problem, while Universal Strings could or could not be the basis of yours truely messy compromise. On that premise, we are all right, and can use this issue to discuss about AS switching to UTF-8 by default again. What do you think? Is AS considering to join the UTF-8 bandwagon, for example for efficient interop with Rust, something you'd welcome?
@dcodeIO What do you think? Is AS considering to join the UTF-8 bandwagon, for example for efficient interop with Rust, something you'd welcome?
Like I said, I don't think it matters too much. I think the performance gain from lift_canon
/lower_canon
are minimal, and any gains will go away after GC strings exist.
So I think the choice should be primarily determined by what makes the most sense in terms of functionality / user API. So I see three choices:
List-of-codepoints (e.g. Python) is good because the user doesn't need to worry about encodings, instead the language will automatically choose UTF-8, UTF-16, or UTF-32 depending on the situation. And the user also doesn't need to worry about codepoint boundaries, and lookup is O(1)
.
The downside of list-of-codepoints is that the language must automatically re-encode the string when adding non-ASCII characters. And it must use 2 bits to keep track of which encoding each string is using, so there's a minor performance cost on every string access.
UTF-8 is good if you want to have a single explicit encoding. The benefit is that UTF-8 is very efficient in space and time, it is the optimal encoding, and you will always be able to use lift_canon
/lower_canon
. The downside is that now the user must think about codepoint boundaries, so lookup is no longer O(1)
.
WTF-16 is good if you want to maintain strict compatibility with JS, but WTF-16 is less efficient than UTF-8 and it doesn't have O(1)
lookup.
However, it is possible to use lift_canon
/lower_canon
with WTF-16, as long as the string is ASCII-only. This can be done as an optimization which is not noticeable to the user.
For AssemblyScript I would personally choose list-of-codepoints, since it is the nicest user API, it has O(1)
lookup, and it can still use lift_canon
/lower_canon
for ASCII strings. I think if JavaScript was designed today it would use list-of-codepoints instead of WTF-16.
Note that the authors of UTF-8 everywhere disagree with me, they believe that UTF-8 is superior to list-of-codepoints.
They make some good points, and I agree with them that grapheme clusters is really what most people should be using, and so I think Rust's approach (UTF-8 + iterator of grapheme clusters) is ideal.
The difference is whether you care a lot about O(1)
indexing/length. If you're willing to give up O(1)
indexing then UTF-8 is indeed faster (especially for web servers).
However, JavaScript programmers expect O(1)
indexing of code points, which is why I suggested list-of-codepoints as a compromise.
Here are some anecdotes about UTF-8 usage:
So there could be advantages to UTF-8 if you want to create an HTTP server or UWP application with AssemblyScript. And there could be performance gains to using UTF-8 when using WASI.
This is largely a choice between "good user API" (list-of-codepoints), "maximum performance" (UTF-8), and "strict compatibility and familiarity" (WTF-16).
All three options have merits, so it's up to the AssemblyScript community to decide which of the three options it wants.
Perhaps a couple additional observations, given AssemblyScript's goals:
Performance: With multiple string representations, and abstracting over them, we may end up in a situation where re-encoding happens implicitly within the language's standard library. Depending on the use case that may be more or less often. More complex solutions may also be slower in general due to doing more work. As such, picking the most reasonable default and sticking with it has certain advantages.
Binary size: One difference to non-Wasm Java, Python and JS is for example that AS typically ships its string functions (and related standard library) with a Wasm binary, while in traditional VMs these functions are installed on the machine. This may relativize when binaries become bigger, but it doesn't necessarily make the "just a simple and small Wasm module" use case less important. The more complex our string handling becomes, the more these small modules will tank.
The list of codepoints idea may be related to both, as if I understood correctly it for example requires to maintain a separate lookup structure of "breadcrumbs" (once read something over at Swift about it, essentially caching the stride one knows already) to minimize the impact of random access. Of course I agree that this would be nice API-wise, as it would kinda just "fix" WTF-16 by eliminating surrogate pairs (incl. making underlying encoding irrelevant), but then there remains the question of how far we can or want to go for a nice API without re-targeting to a different audience. Given that, Rust's approach may be a more reasonable choice. As would be staying with WTF-16 for JS compat.
That's the unfortunate reality we are looking at. AS is essentially torn towards either of the approaches for one or another reason, and much of the heated discussion so far originated in Wasm not (yet) helping. Some sort of universal strings, not necessarily mine, could for example move all the heavy machinery to the VM instead of shipping it with every binary, to give just one example. I still hope we can get something like that eventually (for all languages), it's just that GC strings aren't moving into that direction as well currently, equally motivating this issue.
Perhaps an improvement to universal strings could be exactly that. Make them an encoding-independent list of codepoints, with an encoding-specific Swift-like breadcrumb mechanism to enable fast random access. Hmm...
Interestingly, AS could then opt to just fix WTF-16 and expose strings as a list of codepoints, with some performance cost but without any binary size cost. Perhaps a nice compromise.
The list of codepoints idea may be related to both, as if I understood correctly it for example requires to maintain a separate lookup structure of "breadcrumbs" (once read something over at Swift about it, essentially caching the stride one knows already) to minimize the impact of random access.
No, it should not require a separate lookup structure. It just requires strings to have 2 bits at the beginning which mark whether it is ASCII, UTF-16, or UTF-32.
When concatenating strings together, it always uses the largest encoding (e.g. concatenating an ASCII string with a UTF-32 string returns a UTF-32 string).
When indexing into a string, it looks at the 2 bits and then multiplies the index depending on the bit:
Then it just does a byte lookup with the index, so this is all very fast.
When sending strings to other modules, it checks the 2 bits. If it's ASCII, it uses lift_canon
, if it's UTF-16 or UTF-32 it does the usual list.lift
encoding.
When receiving strings from other modules, it determines whether the string is ASCII/UTF-16/UTF-32 and prepends the 2 bits accordingly.
The only cost to this is looking up the 2 bits when indexing, and also some code bloat because it supports 3 encodings.
This may relativize when binaries become bigger, but it doesn't necessarily make the "just a simple and small Wasm module" use case less important. The more complex our string handling becomes, the more these small modules will tank.
Yes, code size is a good reason to pick a single encoding rather than using list-of-codepoints.
Some sort of universal strings, not necessarily mine, could for example move all the heavy machinery to the VM instead of shipping it with every binary, to give just one example.
From reading your posts in that other thread, it sounds like you think a universal string type can work for 90+% of languages, but I don't think that's true at all. Instead it would only work for a small handful of languages. Like I said before, this issue is so much bigger than just UTF-8 vs UTF-16.
The JVM and CLR are great examples of that: languages which are ported to them end up with significant restrictions and incompatibilities (e.g. Clojure doesn't have tail calls, Jython uses WTF-16, Kotlin is stuck with using WTF-16 because that's what the JVM uses, Eta uses WTF-16 because that's what the JVM uses, etc.)
It's simply impossible to support languages as diverse as Scheme, Haskell, OCaml, Python, Ruby, Clojure, JS, Rust, Go, C++, etc. with a single string type. No matter what representation you choose, you will be excluding dozens and dozens of languages from the Wasm ecosystem, which is not acceptable.
So let's step back a bit... the goal of having One String To Rule Them All is not going to happen, languages will never agree on a single string type.
So there are three (achievable) goals:
list char
, thus fully integrating JS with interface types.So you can already use anyref
to use JS strings inside Wasm and efficiently pass them around. This can be done today without any changes to Wasm.
And if another language happens to also use JS strings (e.g. Java or Clojure or C#) then you can efficiently pass the JS string to them as well.
And after interface types are ready, you can then integrate the anyref
into interface types, and now you get full interop with all languages (regardless of whether they are using JS strings or not). If that other language uses JS strings it will cheaply pass it over, and if it doesn't then it will copy (which is unavoidable).
Given the above situation, are there any remaining problems for AssemblyScript?
(Also note that it is possible for the host or WASI to provide helper functions for encoding/decoding/validation, which will remove a lot of code bloat, so I don't think code bloat is a good justification for choosing a particular string representation)
I am primarily interested in evaluating what would be best for AssemblyScript at this point (i.e. switch to UTF-8), under the assumption that for instance a universal string type is not going to happen (i.e. GC strings are likely going to have the same problem). The relevant challenges to your points are, in short:
Everything else is merely me thinking loud about improving upon the options we have. The "won't happen" and "can't be done" sentiment is worrying me still, though.
The encoder/decoder/validator aren't the biggest size problems btw. These are single functions. Adapting the entire standard library to do something sophisticated, like exposing a WTF-16 API with UTF-8 encoding underneath with breadcrumbs, or supporting both WTF-16 and UTF-8 with a single string class, are the size problems. Either just WTF-16 or just UTF-8, with IT as-is, is not a size problem. But list of codepoints, to for example "fix" WTF-16 at the occasion (also in context of the other thread I linked), needs machinery, that I'd wish Wasm could help with. Of course we can also just call it a day and say it won't happen anyway, if you prefer, but then we are really only back to square one, picking either because nobody gives a damn.
Note that WTF-16 and list-of-codepoints can use lift_canon
and lower_canon
for ASCII strings, so in practice the amount of re-encoding might be a lot less than you think.
Also note that if your goal is to have efficient interop with the browser, then your only option is to use JS strings, and so the entire UTF-8 vs UTF-16 argument becomes moot.
And if your goal is to have efficient interop with the maximum number of languages, JS strings are also probably your best option, since other languages also want efficient interop with the browser, so JS strings are the de facto universal "wasm string".
Breadcrumbs are not necessary for list-of-codepoints or optimizing WTF-16. I explained in my previous post how to implement them.
Breadcrumbs are used in Swift exclusively for calculating indexes when calling Objective-C APIs, that's not really relevant for AssemblyScript.
Here are a couple good articles on Swift strings:
https://swift.org/blog/utf8-string/
https://forums.swift.org/t/piercing-the-string-veil/21700
Adapting the entire standard library to do something sophisticated, like exposing a WTF-16 API with UTF-8 encoding underneath with breadcrumbs, or supporting both WTF-16 and UTF-8 with a single string class, are the size problems.
I don't think it would be a lot of code, the implementation strategy I explained is quite simple, certainly much simpler than breadcrumbs.
Note that WTF-16 and list-of-codepoints can use lift_canon and lower_canon for ASCII strings
What do you mean? ASCII strings are not two-way compatible with WTF-16, unless stored with the compaction trick, but then it isn't WTF-16 anymore but ASCII. Perhaps this misunderstanding explains...
Breadcrumbs [...] I explained in my previous post how to implement them
I don't think you explained breadcrumbs, at all. I see some basic bit flags there, explaining something that is kinda simple and obvious (multiple encodings need a flag somewhere and can use compaction, surprise). Perhaps actually read this section of the document you linked.
Breadcrumbs are used in Swift exclusively for calculating indexes when calling Objective-C APIs, that's not really relevant for AssemblyScript
They are used to abstract the difference in underlying encoding away, i.e. if the API is designed to do indexes access in a way not mapping linearly to the underlying encoding. This would be the case in AS if it kept a JS-compatible API, but would use UTF-8 (or multiple string representations, with one of them being UTF-8) under the hood. How is that not relevant to this issue, that is about UTF-8? What are you even talking about?
I don't think it would be a lot of code
You don't think? Have you implemented the AS standard library or what to come to this conclusion, with its underlying libc-like helpers and related APIs like in arrays assuming a specific string encoding etc.?
Are you even interested in improving AssemblyScript? Or what is this? Why are you here? I just don't understand. I think I remember again why I became so frustrated about this.
I appreciate the creative problem solving in the comment you linked me to, but I don't have an immediate reaction to it. "fix WTF-16 and expose strings as a list of codepoints" makes it sound like it's not that different from what AS currently has, but I assume I'm lacking context.
My thumbs up was just for the idea that WASI can provide helper functions. That's a specific idea that I can understand without a lot of context here, and is something I generally think is a good idea, independently of other concerns here.
I see, thank you for explaining. Providing helpers sounds good to me as well as these will be useful independently. Sorry for the noise.
"fix WTF-16 and expose strings as a list of codepoints" makes it sound like it's not that different from what AS currently has
And yeah, that's why I came to think about it as well as it fixes two problems people are interested in at once with little API impact on AS. The underlying encoding could then be UTF-8, but it would benefit significantly from Wasm providing that breadcrumbs mechanism for where non-trivial indexed access is performed so the language doesn't have to ship it itself where strings are exposed as a list of codepoints (or UTF-16 for that matter), and inter-language communication can take advantage of it as well. Sounds like a neat compromise to me, but requires some sort of specified string type to enable the mechanism. So that'd be where Wasm could come into play. Not saying we should do it, but I am interested in thinking about it more and have your thoughts.
@dcodeIO What do you mean? ASCII strings are not two-way compatible with WTF-16, unless stored with the compaction trick, but then it isn't WTF-16 anymore but ASCII.
If the string uses 1 bit to mark whether it's ASCII or WTF-16 then it is trivial (and fast) to use lift_canon
and lower_canon
for ASCII-only strings. This is a very common technique in many languages (including Swift), and it is separate from breadcrumbs.
I don't think you explained breadcrumbs, at all.
Yes, exactly, I was explaining how Python implements list-of-codepoints, which does not require breadcrumbs at all. It is just 2 bits marking which type of encoding it is using:
The downside is that it needs some extra code to handle the multiple encodings (but that shouldn't be too complicated, since each encoding is fixed length, so it just needs to multiply the index, it doesn't need breadcrumbs).
Perhaps actually read this section of the document you linked.
Yes... I did read it (and I even explained what breadcrumbs are used for). You seem to have poor reading comprehension, since you've multiple times misunderstood what multiple other people are saying. I think your frustration comes mostly from your own misunderstandings and ignorance.
This would be the case in AS if it kept a JS-compatible API, but would use UTF-8 (or multiple string representations, with one of them being UTF-8) under the hood.
I don't recommend that, I think that would be worse performance than simply using WTF-16, which is why I assumed that wasn't an option.
Are you even interested in improving AssemblyScript? Or what is this? Why are you here? I just don't understand. I think I remember again why I became so frustrated about this.
All I have ever done is given factually correct information and explained various possible options and tradeoffs. I am not a part of the AssemblyScript community, so it's not my place to make decisions, I can only give information. It seems you aren't interested in that, so this will be my last message to you.
The downside is that it needs some extra code to handle the multiple encodings (but that shouldn't be too complicated, since each encoding is fixed length, so it just needs to multiply the index, it doesn't need breadcrumbs).
It's very complicated actually even for compact strings (when we have only latin1 bit). With three different encodings we should handle 8 different implementations for binary methods like string concat, compare, replace, includes / indexOf and etc. It will increase runtime signifintally
@MaxGraey You would not need whole separate implementations, as you couldn't know, at compile-time, if a string is in one encoding or another. More likely, there would just be a simple switch case or match inside of every method, slowing down runtime performance.
Dealing with ASCII.concat(UTF8, WTF16, Latin1)
won't be fun for anyone involved.
How about, just leaving the current WTF-16, and taking the alleged performance drop incurred by using interface type strings?
@MaxGraey With three different encodings we should handle 8 different implementations for binary methods like string concat, compare, replace, includes / indexOf and etc. It will increase runtime signifintally
No, because the key point is that all string operations are normalized to the highest encoding, and the only difference between the encodings is the number of bytes (there are no surrogate pairs).
That means you only need 3 conversions (ASCII -> UTF-16
, ASCII -> UTF-32
, and UTF-16 -> UTF-32
), and re-encoding is very easy, since it is simply adding zero bytes as padding (re-encoding doesn't need to worry about surrogate pairs), and it means that most of the code can be shared between the different encodings.
For example, here is some pseudo-code for a concat implementation:
function concat(x, y) {
// Get the byte size of the strings
const xSize = x[0];
const ySize = y[0];
const size = Math.max(xSize, ySize);
// Preallocate the precise amount of memory
const output = malloc(1 + (x.length * size) + (y.length * size));
// The output byte size is always the biggest of the input byte sizes
output[0] = size;
if (xSize >= ySize) {
// Standard string concatenation
memcpy(x, output, 1, x.length * size);
} else {
// Same as memcpy except it adds padding to each byte
memcpyPad(x, output, 1, x.length * size, xSize, size);
}
if (ySize >= xSize) {
// Standard string concatenation
memcpy(y, output, 1 + (x.length * size), y.length * size);
} else {
// Same as memcpy except it adds padding to each byte
memcpyPad(y, output, 1 + (x.length * size), y.length * size, ySize, size);
}
return output;
}
You only need that one implementation, not a separate implementation per encoding.
Compared to a standard string concat, the above implementation only has a tiny bit of extra overhead: looking up the byte size, doing Math.max
on the byte sizes, doing two branches, and padding the bytes if the encodings are not the same.
The same principle applies to other functions like split
, etc.
function split(str, needle) {
// Get the byte size of the strings
const sSize = str[0];
const nSize = needle[0];
// If this is true then we know the needle can never match, so we return early.
if (nSize > sSize) {
return str;
}
// If the needle encoding is different then we have to add byte padding.
// This can be avoided with some tricky logic, but it's probably not worth it.
if (nSize < sSize) {
needle = reencode(needle, sSize);
}
// Do standard split algorithm...
// Since both strings are guaranteed to be the same encoding, and we know the byte
// size, we can use the same algorithm for all encodings.
return realSplit(str, needle, sSize);
}
As you can see, split
only requires a handful of instructions at the beginning, and everything afterwards is the same for all encodings.
If you decide to reject list-of-codepoints, that's fine, but reject it for the right reasons, not based on misinformation.
That means you only need 3 conversions (ASCII -> UTF-16, ASCII -> UTF-32, and UTF-16 -> UTF-32), and re-encoding is very easy, since it is simply adding zero bytes as padding (re-encoding doesn't need to worry about surrogate pairs), and it means that most of the code can be shared between the different encodings.
I don't see any benefits at all for ASCII -> UTF-32
and UTF-16 -> UTF-32
conversions. UTF-32 only used in Python. Nobody use this encoding at all because it's huge memory consumption.
Your implementation doesn't make sense at all. You can't just memcpy with some padding. It's the same time like convert Array<u8>
into Array<u32>
with just copy bytes. No, this will not work! You need lowering or widening for each codepoint. For UTF16 (if it's not UCS-2) you're also need decoding. That's why you need special scenarios for concat(utf8, urf16)
and concat(urf16, utf8)
. And this is most simple method in strings
Just see on unicode comparison in python: https://github.com/python/cpython/blob/master/Objects/unicodeobject.c#L11228 And special method for only Latin1 strings: https://github.com/python/cpython/blob/master/Objects/unicodeobject.c#L11387
Languages like Java and Python which have a fat runtime that they don't include with every program and can afford anything. In addition, Java / C # also has JIT compilation that allow them to speed up the whole thing by eliminating most of extra runtime checks. We cannot afford to complicate the runtime this way.
Maybe insight to be had https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/
They do a hashmap to convert from internal u32 for pandas interop and note in the blog “Hashing data is not free, but counterintuitively it can be faster in addition to being vastly more memory efficient...with many instances of the same string values.” Which maybe the case not only with categorical data columns, but also templating strings (html and css). Although , Arrow format might even be reasonable for the as runtime. Id guess the workloads for arrow formatted data and wasm processing are probably pretty similar.
@MaxGraey I don't see any benefits at all for ASCII -> UTF-32 and UTF-16 -> UTF-32 conversions.
The UTF-32 is critical to make it work, it is not optional.
Nobody use this encoding at all because it's huge memory consumption.
UTF-32 is necessary in order to ensure that the string is a fixed length encoding (and to ensure that algorithms are efficient since they only need to pad bytes).
Your implementation doesn't make sense at all. You can't just memcpy with some padding. No, this will not work! You need lowering or widening for each codepoint. For UTF16 (if it's not UCS-2) you're also need decoding.
No... as I have explained many times, this system does not have surrogate pairs, that is the secret to why it is efficient and easy to implement. And that's also why it requires UTF-32 (for code points above U+FFFF).
That's why you need special scenarios for concat(utf8, urf16) and concat(urf16, utf8).
No, you don't, my implementation handles both perfectly fine.
For example, here is Python's implementation of concat (you'll note it is very similar to the implementation I posted before, except mine is shorter, since I'm using Math.max
and two loops, rather than using branching + 1 loop):
It calls the _PyUnicode_CONVERT_BYTES
macro, which just adds padding to the bytes:
Just see on unicode comparison in python
They are handling each case individually with switch
(presumably for performance), but it is easy to handle the cases with much less code:
function compare(x, y) {
const xSize = x[0];
const ySize = y[0];
const max = Math.max(xSize, ySize);
const minLength = Math.min(x.length, y.length);
if (xSize === ySize) {
return memcmp(x, y, 1, 1, minLength);
} else {
for (let i = 0; i < minLength; ++i) {
const xChar = readChar(x, i, xSize);
const yChar = readChar(y, i, ySize);
if (xChar < yChar) {
return -1;
} else if (xChar > yChar) {
return 1;
}
}
}
if (x.length === y.length) {
return 0;
} else if (x.length < y.length) {
return -1;
} else {
return 1;
}
}
And special method for only Latin1 strings
You are misreading the code. That isn't for Latin1 strings (since those are already handled by unicode_compare
). The PyUnicode_CompareWithASCIIString
function is for comparing with C strings, not Python strings.
It calls the _PyUnicode_CONVERT_BYTES macro, which just adds padding to the bytes:
It's not copy! It's conversion (lowering or widening per element) from from_type
to to_type
with unrolling by factor of 4. It's different thing. That's why it's macro, not just simple memcpy. And this should be done for all pair combinations
And that's also why it requires UTF-32 (for code points above U+FFFF).
You should need extra bit to determine it's two U+FFFF
or it's single U+{FFFFFFFF}
. See how this done on LEB128 and UTF8. Usually it's reserve MSB. So you actually will need mask this bit by 0x7FFF
and extract this bit, check it. So it's actually not better than decoding UTF16.
If you're suggesting to use fixed encodings instead variable. It will be even worse due to we should check and re-encode to upper char type after every string operation
@MaxGraey It's conversion (lowering or widening per element) from from_type to to_type with unrolling by factor of 4.
Yes... it's adding a zero byte to the beginning of each code unit (i.e. padding).
The loop unrolling is just a micro-optimization, it's not necessary and it's not relevant to this discussion. And in any case the loop unrolling would be done in a function, so it wouldn't bloat up the code size.
Here is a single implementation which would work for all the encodings:
function memcpyPad(input, output, index, length, fromSize, toSize) {
const diff = toSize - fromSize;
while (index < length) {
const fromIndex = index * fromSize;
const toIndex = index * toSize;
for (let i = 0; i < diff; ++i) {
output[toIndex + i] = 0;
}
for (let i = 0; i < fromSize; ++i) {
output[toIndex + i + diff] = input[fromIndex + i];
}
++index;
}
}
That's why it's macro, not just simple memcpy.
Yes... because memcpy does not pad bytes, so it uses a loop, just like my implementation.
And this should be done for all pair combinations
No, it is not necessary, look at my implementation, it handles all possible combinations.
You should need extra bit to determine it's two U+FFFF or it's single U+{FFFFFFFF}.
No... because the byte size is stored as the first byte of the string, as I explained, so it always knows the byte size.
See how this done on LEB128 and UTF8.
But this is not a UTF-8 encoding, it is list-of-codepoints.
So you actually will need mask this bit by 0x7FFF and extract this bit, check it. So it's actually not better than decoding UTF16.
No none of that is necessary.
If you're suggesting to use fixed encodings instead variable. It will be even worse due to we should check and re-encode to upper char type after every string operation
Yes, as I have clearly stated multiple times, I am explaining how list-of-codepoints is implemented (which is a fixed length encoding, as I have stated multiple times).
And yes it must re-encode one of the strings if the encodings are different and it is returning a new string. But in exchange for that you get lower memory usage, lift_canon
/lower_canon
support, O(1)
indexing, and you get to keep the same JavaScript API.
And in practice the APIs which return new strings must loop over the string anyways (e.g. split
), so it's easy to fuse the re-encoding into the loop, and re-encoding is just padding bytes, so it's very fast.
If you are willing to give up O(1)
indexing (and break compatibility with JS), then pure UTF-8 is better than list-of-codepoints.
And if you want to have fast interop with the browser, then your only option is to use JS strings (with anyref
).
Yes... it's adding a zero byte to the beginning of each code unit (i.e. padding).
no it's called widening and narrowing conversions. Copy with padding it's absolutely different operation.
You're suggesting add extra amortization time and space for all string methods to avoid O(N) time for string length and O(1) for random indexing access. Are you sure it's faster in general case? Because even if this fast enough it has a lot of disadvantages like huge memory footprint, huge codesize footprint, huge maintenance overhead. And for some reason it seems to me that it will not be fast, it will be very mediocre. Since this only applies in Python, which never made performance goals.
Are you sure it's faster in general case?
Of course you would have to run benchmarks to be sure of the speed, but similar systems are used all over the place (including Java, and JS engines, which care about performance a lot).
And it's complicated further because of lift_canon
/lower_canon
, so you really would have to run benchmarks to get a clear picture of the speed.
huge memory footprint
How is it a huge memory footprint? Most strings are ASCII, so they use 1 byte, and anything less than U+FFFF uses 2 bytes (same as UTF-16), so in practice only emojis would cause an inflation to 4 bytes.
Of course there are pathological cases where it would use 4 bytes a lot (e.g. if you're processing emojis a lot), but in that case UTF-8 and UTF-16 will also use more memory (since they both use 4 bytes per emoji).
Rather than guessing, it's better to use benchmarks on real programs. The results are often not what you expect.
huge codesize footprint
As I showed, the implementation only adds a fairly small amount of extra code to each function.
huge maintenance overhead
I agree it is more complex than using a single encoding, and that's a valid reason to prefer a single encoding.
You're suggesting add extra amortization time and space for all string methods to avoid O(N) time for string length and O(1) for random indexing access.
From a performance perspective, yes I agree pure UTF-8 is generally better, however it breaks compatibility with JS (and JS programmer's expectations).
JS programmers are used to having fast indexing, and they're used to ignoring Unicode and assuming that each character is a code point. That assumption works for characters below U+FFFF and JS programmers rarely test for characters above that.
All of that would fall apart with UTF-8, now every API needs to be changed to be Unicode-aware, and JS programmers can no longer just process strings in a simple loop, instead they have to take into account the variable length of every character. In other words, this sort of thing would no longer work (despite being a very common idiom):
function foo(str) {
for (let i = 0; i < str.length; ++i) {
processCharacter(str[i]);
}
}
Since AssemblyScript is specifically catering and marketing to JS programmers, that is a big problem.
So if you want to maintain the same API, and maintain JS programmer's expectations, I think list-of-codepoints is a good system that makes good tradeoffs (at the cost of a small amount of performance and code size).
But if instead you want to break from JS and make AssemblyScript into its own separate thing, with a different API, then yes I would recommend UTF-8.
UTF-32 in most cases unnecessary. UTF-16 based languages like Java and C# use string compaction. It's pretty similar to your approach but use only one boolean flag for indication latin1 / UTF-16 string. Btw in Java string compaction mode turned off by default and could enable via -XX:+CompactStrings
flag. That's because it not always preferable however it reduces GC overhead and memory footprint for latin-only strings.
Unlike Compressed Strings, the new solution does not contain any String repacking, thus should be much more performant. In addition to significant memory footprint reduction, it should provide a performance gain when processing 1-byte Strings as there is much less data to be processed. Garbage Collection overhead will be reduced as well. Processing of 2-byte string does mean a minor performance hit because there is some additional logic for handling both cases for Strings. But overall, performance should be improved as 2-byte Strings should represent just a minority of all String instances. In case there are performance issues in situations, where the majority of Strings are 2-byte, the feature can always be disabled.
UTF-16 based languages like Java and C# use string compaction.
Yes, that is correct, but then you (and the users) have to deal with surrogate pairs. And if you're already adding in that complexity, you might as well fix the problem with surrogate pairs.
Btw in Java string compaction mode turned off by default and could enable via -XX:+CompactStrings flag.
No, since Java 9 it is on by default, because in general it improves the performance of strings.
https://www.baeldung.com/java-9-compact-string#1-difference-in-performance
https://openjdk.java.net/jeps/254
https://www.javagists.com/compact-strings-java-9
And of course it is always enabled in JS engines (also because it improves performance in general).
And of course it is always enabled in JS engines (also because it improves performance in general).
As I mentioned before we can't do this in our case. Standalone VMs can use rope data structures for string concatenations, hashing and many other clever tricks becouse they're not limit by size. We're should expose small runtime
Before I comment, I want to mention that I'm not as well versed in AS internals nor the WASM specification as you all are. Apologies if I'm misunderstanding something.
Throwing in a +1 for UTF-8 by default (or at least, by configuration) due to conversions in other languages being trickier to do than in Javascript (which can be achieved, albeit messily, using decodeURIComponent(escape(s))
). Converting for other environments (such as C/C++ within wasm3) means extra allocations even for the happy path.
While performance is critical to most people in most cases, and that someone will have to pay for the encoding discrepancies between platforms (since Javascript will most likely always be UTF-16 specified), it makes more sense to me that if you're running WASM payloads in a native environment, you probably have a more performance-critical application than e.g. running in Electron, a browser, or some other web-based environment.
Thus, being able to avoid allocations in native environments would be more preferable than in the web, where such performance is indeed a concern but not as much relative to native environments - in my opinion, at least.
For context, I intend to have my WASM payload execute in both web and native environments; as such, I very much prefer my native environment to be the more performant of the two.
I realize this is subjective, but I'm here in this issue simply because of the headache of having to convert all of my string parameters between UTF-16 and UTF-8 manually, for all of the import bindings that work with strings.
Doing this conversion in the Javascript world instead would be less of a headache due to the syntactic and semantic sugars afforded by the Javascript language.
If nothing else, being able to specify that I want the strings to use UTF-8 at compile time would be a killer feature - unless I'm missing something obvious, this is not currently the case. I can see the ability to switch between the two depending on my application being very, very useful.
The main problem with UTF-8 Strings it will be completely different String API and DX. No charAt
or charCodeAt
, different .length
and etc. Also it will be required iterators I guess. So UTF-8 strings increase learning curve and non-compatibility with TS very significantly
@MaxGraey are strings immutable under the hood? If so, this solves the .length
problem if you're willing to use pascal string length encoding.
C Pseudo-code:
typedef uint64_t strlen_t;
// allocate
char *ptr = malloc(sizeof(strlen_t) + string_length + 1);
*(strlen_t *)ptr = string_length;
ptr += sizeof(strlen_t);
memcpy(ptr, src_string, string_length + 1); // remove +1 is src_string is NOT null terminated
ptr[string_length] = 0; // remove entirely if src_string IS null terminated
// free
free(ptr - sizeof(strlen_t));
// get length
strlen_t len = ((strlen_t *)ptr)[-1];
I would also argue that the charAt
and charCodeAt
tradeoffs (going from constant -> linear time) are agreeable as long as they're well documented. This is anecdotal, I understand this, but I've not used either of those very often in the ~15 years I've been writing Javascript almost full time. I know usecases differ, but I am in the thought-camp that there are almost always better ways of doing something if you're using those methods.
For the most part, I use those to convert from character codes to ordinals for single characters at a time, e.g. the index parameter almost always being 0
. I can't speak for everyone, but I would imagine this is an OK tradeoff.
Another thing to consider, though I don't know if it's possible - if you go the configuration route, then have two different string types (or one for however many string types you want to be configurable) and change an alias for the main string
type (instead of making it its own independent type). This would mean that, for each string function in the standard library, you'd have different overloads for each that would "just work" regardless of the configuration.
This might place more burden on library vendors but in the case of AS I think it makes a lot of sense - we already do this in other languages without much of a problem.
@MaxGraey are strings immutable under the hood
Yes, but length
in UTF-16 in bytes (which currently using) and UTF-8 in bytes always different. It may cause to problems during porting. Also it need add extra length in code points (with have complexity O(n)) because length
in JavaScript more or less represent this length as well which make it very useful.
This might place more burden on library vendors but in the case of AS I think it makes a lot of sense - we already do this in other languages without much of a problem.
Hmm. What's languages? All UTF-16 based langause Java, C# and Dart still use UTF-16 Strings. Only Swift migrated to UTF-8 but it uses some special mechanism for compatibility with NSString and completely changed API for strings
Sure though there are ways to encode this (both bytes and codepoints, using all-1-bits to signal "not yet calculated" and doing a lazy calculation, etc.). I understand the concern though - it's not cheap.
The problem with u32, on the other hand, is indeed static string space consumption. Given that WASM payloads are already going to be larger than we'd all like, and that they're going to be on the wire much more often than other binary application payloads have historically been (installers, MSIs, etc.) then it's important to trim sizes where it matters. I think u32 works against that. I also second the sentiment of usefulness to developers - the only time I've personally used u32 is when writing unicode-aware parsers.
What's languages?
C (namely WinAPI, which has both W
and A
suffixes for most calls) and C++, which has implementations of all sorts of string libs (including codecvt
), but std::string
defaults to std::basic_string<char>
(8-bit, though the encoding is not specified). I don't know of a single *nix API that requires "wide strings" or the like, and it's generally assumed that UTF-8 is accepted in all cases where strings are accepted (with certain, well-documented edge cases, such as forbidden characters in pathnames).
Zig is a new and upcoming one, and all strings are UTF-8 encoded.
Rust uses Vec<u16>
for "16 bit encoding" (not sure what that means exactly, I'd have to research more) but by default std::str
is UTF-8, as specified by the standard.
Go also uses UTF-8 encoded strings.
Keep in mind NSString came before a time of utf-8 proliferation, and was largely unchanged from the original NextStep days if I understand correctly. This would explain its encoding not matching newer languages', along with Java's and Python's.
As far as other VM languages go, Lua uses 8-bit under the hood (though its encoding is unspecified, generally considered to be ASCII but it supports utf-8 storage - the string
library is not UTF-8 aware I do not believe).
EDIT: honorable mention PHP, where strings have no specified encoding and thus literals are encoded depending on the file encoding (lol). I know it seems like "lol PHP who needs that" but I can guarantee you there will be a PECL module that exposes a WebAssembly engine at some point. Admittedly, those users will have the conversion done for them under the hood, so I wouldn't worry about them as much.
Sorry I meant do you know languages which have UTF-16 strings in the past and switched to UTF-8 one without a problems except Swift?
Languages which already designed with UTF-8 and latin1 (Like C/C++) in mind is not good examples for us.
Oh, no. Which is why I suggest a compilation option, defaulted to whatever it is now (utf-16). :)
This shifts the migration scope to the individual developer, not the entire community - meaning, if someone wants to change the string encoding from UTF-16 to UTF-8, it only affects their project, not all of the users of AssemblyScript. Of course it's something they will have to bear the cost of migrating, but at least it's a conscious decision.
And for the rest of the users who are starting today or next week or whatever, they can choose this from the start. I agree with both sides of the issue and can see a lot of valid usecases for both. Why not give the developer the option? As long as it's well-documented that changing the encoding away from the default, you change the underlying data structures passed around, that's fine.
To better detail what a compilation flag would look like:
.charCodeAt(n)
is O(1) if --encoding=utf-16
or utf-32
, O(n) if --encoding=utf-8
".string == string16
or string == string8
, etc. However, if the interface between all two (or three) string types is the same (just with different complexities) then it shouldn't make any difference to 99% of library authors I would think.--expose-encoding
would expose a runtime function that could be called to return a single number 8
, 16
or 32
that indicates the encoding used at compile time. This is in the same vein as other --expose-X
functions that AS already provides.It's a fair bit of work, but if the problem of encoding is a consistent issue, this could be one approach.
EDIT: Also, if you ask me, this would be a really impressive and attractive feature to have for those evaluating AS.
Languages which already designed with UTF-8 and latin1 (Like C/C++) in mind is not good examples for us.
This ignores those demographics, though. I think they should be taken into consideration - after all, WASM is a format specifically designed to be run in different environments as a sort of machine code "lingua franca". The fact it's web-first only means that it has inherent compatibility where other VMs do not. The fact is, WASM will be utilized from those languages and as such the DX should be considered just as you're considering the DX in Java, Swift, C#, etc.
Just my opinion, however. I think ultimately, string-heavy languages or environments are going to benefit more from a fixed-width encoding such as UTF-16, wheras more imperative/binary/network environments are going to benefit from UTF-8 more. This ultimately comes down to application. I understand if AS wants to be opinionated in this regard, and it's certainly not going to deter me personally from using AS, but it's still something to be considered from my point of view.
Oh, no. Which is why I suggest a compilation option, defaulted to whatever it is now (utf-16). :)
This shifts the migration scope to the individual developer, not the entire community - meaning, if someone wants to change the string encoding from UTF-16 to UTF-8, it only affects their project, not all of the users of AssemblyScript. Of course it's something they will have to bear the cost of migrating, but at least it's a conscious decision.
Right now, only the end application developer compiles their application (including with all libraries they imports). This is like using Webpack or Rollup to make a final bundle that you run in your web app. This means the user has a single .wasm
file that they load in their app.
This implies that the app developer make the compiler argument choices. Now suppose the app dev choose to compile with UTF-16 (or perhaps WTF-16).
Now assume that app dev installed a library A whose author wrote code expecting UTF-8, and app dev also installed some other library B whose author wrote their library assuming UTF-16.
Note, the library authors do not compile their code, they simply ship TypeScript source code on npmjs.com
, and their source code simply uses string
in places where type annotations are valid.
Considering that the app dev compiled the whole app using the option for --encoding=utf-16
(or --encoding=wtf-16
), what happens if the app dev imports library A then passes a string
generated by library A to library B? And vice versa? What about a string
from library A to a function made by the app dev?
Will that work seamlessly the way you are imagining it?
what happens if the app dev imports library A then passes a string generated by library A to library B
In which case would this happen where the compiled code is not all compiled under the same compilation session?
If you're interfacing two separate WASM payloads, it makes sense that you're going to have to know the datatypes and formats between them. We do that already with e.g. the Windows API and other system APIs, where (at least historically) we had to write stubs that converted window's wchar_t
encoding to UTF-8 (application) and back when working with string data.
This is no different.
As far as I understand, there is no way to determine the underlying byte structure from the Javascript/Typescript end, and the API functions don't make that distinction either (except for, of course, time complexity).
I guess I don't see the "library A -> library B" problem since everything is compiled at once.
If you mean e.g. some sort of intermediate file, such as a COFF object file equivalent, it makes sense that two different ABIs cannot be sensibly compiled together. This is also the case in the C world. If I compile a source file with __stdcall
and another one tries to call those functions with __cdecl
or the like, they're not going to be happy when the program segfaults or worse.
When entering the realm of compiled languages, these are things programmers have to care about - there's really no way around it. The environment in which you run your WASM will always matter, just like the environment you run any other system code matters.
When entering the realm of compiled languages, these are things programmers have to care about - there's really no way around it. The environment in which you run your WASM will always matter, just like the environment you run any other system code matters.
One of the most important goals of WebAssembly is to abstract from the CPU (and it really is). The main goal of WASI is to abstract from the operating system and it really is, although this only happens for standalone runtimes for now, browsers do not yet support WASI and require polyfill or work directly with the browser API. The upcoming Interface Types should bring another abstraction that will allow to interop data between wasm modules built in different languages without having to consider their specific circumstances, including the format of strings used in those languages. This is the power and philosophy of the WebAssembly ecosystem.
But this isn't in regards to CPU specifics, this is string encoding. Calling conventions have little to do with CPUs too, which was the only analogy I mentioned.
What you're asking for is a magic way to have both worlds, which isn't possible. The next best thing is giving the developer the power to do what they need to do to solve their problems.
I'm reading between the lines here a bit, but it sounds like a "magic" solution is being sought, which I personally think is the absolute wrong approach.
AssemblyScript is not as simple and easy as a Typescript runtime. It looks similar but in reality you're working with a slightly lower level set of tools and concepts. I think it's fine to allow users to specify which string encoding implementation they want to use during compilation. Understanding string encodings and the tradeoffs therein is a common part of software engineering, after all. If it's a gap in an individual's knowledge, it'll only benefit them to understand the difference anyway, and it won't limit the usefulness of the tool for those that already do.
I think it's fine to allow users to specify which string encoding implementation they want to use during compilation
The same thing over and over again =) You cannot make strings with the same interface for UTF8 and UTF16 / WT16. They will be Absolutely two different APIs, even a thing like str.length will produce two different values for UTF8 and UTF16 strings. Not to mention the fact that it's going to be quite hard to maintain (because you will have to debug and test for both modes) This has been discussed many times here, on Twitter, and in related threads.
Given the amount of foregoing heated discussions on the topic, especially in context of Interface Types and GC, I am not getting the impression that anything of relevance is going to change, and we should start planning for the worst case.
So I have been thinking about what would be the implications of switching AssemblyScript's string encoding to W/UTF-8 by default again, and that doesn't look too bad if all one really wants is to get rid of WTF-16, is willing to break certain string APIs, wants it to be efficient after the breakage and otherwise is not judging.
Implications:
String#charCodeAt
would be removed in favor ofString#codePointAt
String#charAt
would be removed, or changed to retain codepoint boundaries if viableString#[]
would be removed, or changed to retain codepoint boundaries if viable, or to return bytes numeric like CString#length
would return the length of the string in bytesSting.fromCharCode
would be removed, or deprecated and polyfilledString#split
with an empty string separator would split at codepointsMeans we'd essentially jump the UTF-8 train to have
.className = "abc
)Note that the proposition of switching AS to UTF-8 is different from most of what has been discussed more recently, even though it has always been lingering in the background. Hasn't been a real topic so far due to the implied breakage with JS, but unlike the alternatives it can be made efficient when willing to break with JS. Certainly, the support-most-of-TypeScript folks may disagree as it picks a definite site.
If anything, however, we should switch the entire default and make a hard cut because
Thoughts?