Open sunfishcode opened 4 years ago
I agree. One way to avoid the braking change is to deprecate the current string API that produces a message outlining the move to String.JS
and a link to documentation of why.
Also, what is the preferred way of getting the proper length in JS?
In JavaScript, it's surprising that "🤦🏼♂️".length == 7
Honestly this is not only problem of JavaScript. But Rust, Swift, Java and all other languages which measure string length in at least code points instead graphemes. All languages exposed some new api for iterate over graphemes. For example Rust:
use unicode_segmentation::UnicodeSegmentation;
fn main() {
for g in "नमस्ते्".graphemes(true) { // hindi
println!("grapheme - {}", g);
}
}
JavaScript will do this (with Intl.Segmenter
which support only in Chrome Canary and FF nightly for now):
const segmenter = new Intl.Segmenter("hi", { granularity: "grapheme" });
for (const { segment: g } of segmenter.segment("नमस्ते्")) {
console.log("grapheme: ", g);
}
I agree. One way to avoid the braking change is to deprecate the current string API that produces a message outlining the move to String.JS and a link to documentation of why.
I don't think we should rework existing String or expose something new which fully unicode aware (Str
for example). This will blowup runtime significantally. See comment author of swiftwasm which thinking of switch back to weekness UTF-16 variant of strings due to runtime of Swift is up to 5-6 mb
just for hello world.
Also in most of cases we don't really care about fully compatible UTF strings. But if you really need it you could include something like Intl.Segmenter
.
By the way, no one language supports all special cases of unicode case mapping. Just most popular like Greek Final Sigma (which context dependent): https://github.com/rust-lang/rust/issues/26035
we also support this: https://github.com/AssemblyScript/assemblyscript/pull/1113
But SpecialCasing.txt
contain a lot more special complex cases which nobody handle (expect languages which support full ICU library which increase size of their runtime really significant)
Unicode standard is really mess currently, it has a lot of redundant glyphs, and really large amount of special cases which constantly grow from version to version.
@sunfishcode
Similarly, functions names borrowed from JS using the term "Char", such as fromCharCode, are confusing to programmers coming from non-JS languages, since code units aren't always characters.
Since ES6 JS also support code points via Array.from
, str.pointAt
, str.codePointAt
, String.fromCodePoint
methods. other methods like toUpperCase
as unicode aware except split
. Simple example:
function reverseStringNaive(str) {
return str.split("").reverse().join("");
}
function reverseStringUnicodeAware(str) {
return Array.from(str).reverse().join("");
// or [...str].reverse().join("");
}
console.log(reverseStringNaive("foo 𝌆 bar"));
// will print:
//> rab �� oof
console.log(reverseStringUnicodeAware("foo 𝌆 bar"));
// will print:
//> rab 𝌆 oof
Also, what is the preferred way of getting the proper length in JS?
@willemneal
// calc length in O(1)
console.log("🤦🏼♂️".length); // "7"
// calc length in O(N)
console.log([..."🤦🏼♂️"].length); // "5" and now it align with Rust's "🤦🏼♂️".chars().count() result
iterate by code point (unicode aware):
for (const c of "foo 𝌆 bar") {
console.log(c);
}
So JS/TS/AS contain all instruments for handling with unicode strings and even graphemes. But It don't need in 95% of cases)
Here's what the official ICU documentation says about linrary size:
ICU includes a standard library of data that is about 16 MB in size. Most of this consists of conversion tables and locale information. The data itself is normally placed into a single shared library.
Update: as of ICU 64, the standard data library is over 20 MB in size. We have introduced a new tool, the ICU Data Build Tool, to replace the makefiles explained below and give you more control over what goes into your ICU locale data file.
So some languages (like Swift) use or small_icu mode of ICU or use own custom implementations (Go, Rust, C#, C++) which contain only really necessary unicode property tables compressed as static tries (three staged indirect lookups). We also follow this way.
Regarding string interfaces. What we can do for make it mode safer for users? That's good question. I guess we could declare some methods as deprecate and non recommend for usage (like charCodeAt
, String.fromCharCode
, split
and others). Typescript's definition file supporting /// @deprecate
directive comment now. We have own definitions for stdlib so this possible. Secondary we could ban this methods with alternative suggestions in --pedantic
mode. @dcodeIO wdyt?
I am not sure about deprecating these, like if one wants to deal with a WTF-16 string, then these are still the way to go I think? For instance, we use them for small strings in the loader because that's faster than piping through a TextDecoder. Of course we can add more details to the documentation, with a neutral link for everyone interested to learn more?
I meant deprecate some string methods inside AssemblyScript in future when Array.from
and iterators landed
I am not so sure about that, unless JS itself officially discourages their use. Otherwise we are just creating unnecessary barriers, aren't we?
How about deprecate it only for --pedantic
mode?
Btw Rust also required special counting api for code points (aka CCs/Character Codes):
println!("utf8 units (bytes): {}", "🤦🏼♂️".len()); // 17 -> 17 bytes
println!("code points: {}", "🤦🏼♂️".chars().count()); // 5
JS:
console.log("utf16 units (ushorts):", "🤦🏼♂️".length); // 7 "🤦🏼♂️".length * 2 -> 14 bytes
console.log("code points:", [..."🤦🏼♂️"].length); // 5
So Rust's len()
and JS .length
just retrieve encoded length in different unit spaces (like ft
and m
) which could simply converted to bytes and this makes sense. If you need count of character codes you should call different method, if you want count it as visible/writing units (graphemes) you should deal with totally another api.
Really not sure. If the broader ecosystem goes that route, perhaps a diagnostic message about a potential pitfall, but I don't see a reason for asc to spearhead something like that. At this point in time, I'd expect that most users would complain about it.
@willemneal
Also, what is the preferred way of getting the proper length in JS?
The preferred way is to ask a more specific question :wink: . Are you asking for the visual width, the number of user-perceived characters, the number of Unicode code points, or the number of bytes of storage used?
@MaxGraey
I don't think we should rework existing String or expose something new which fully unicode aware
I agree. I'm not looking to add new functionality in this issue, but just to present the current functionality in a different way.
console.log(reverseStringNaive("foo 𝌆 bar")); // will print: //> rab �� oof
This is a good example -- it's hard to see how this behavior helps anyone, except via bug-for-bug compatibility with JS.
@dcodeIO
I am not so sure about that, unless JS itself officially discourages their use. Otherwise we are just creating unnecessary barriers, aren't we?
I'm agree; deprecation feels too strong here. In particular, for functions like split
, it is possible to use them correctly, and there's currently no simple alternative.
Renaming split
to String.JS.split
would mean it remains available, and if in the future someone wanted to add a new split
to AssemblyScript which did respect code-point boundaries, there'd be an obvious place in the namespace for it. And that seems like a door worth keeping open -- such a thing would still be familiar to JS programmers, it just wouldn't have surprising edge-case behavior.
We could do even better. We could actually fix String#split
and probably other methods which not handle surrogate pairs. We already do similar fixes for array.sort()
with default comparator and couple other methods. It's legit due to we don't care about legacy code compatibility. Also all this methods will be behave identical to JS / TS until 0x10FFFF
code point. And only keeps old behaviour for s.charCodeAt(i)
and String.fromCharCode(...)
But your fix is someone else's bug. The Array#sort
problematic is a little different in that it's behavior is just odd and doesn't do anything useful, but String#split
is something people got used to and might expect to function exactly that way. What if we provide non-standard but safe alternatives instead, like String#splitCodePoints
, String#lengthCodePoints
?
but String#split is something people got used to and might expect to function exactly that way
If people use String#split
it means they're don't care about surrogate pairs at all. Otherwise they're using something like:
let unsafe = 'Emoji 🤖'.substr(0,7); // Emoji �
let safe = [...'Emoji 🤖'].slice(0,7).join(''); // Emoji 🤖
// or do this via regexps
or third-party libraries like: https://github.com/mathiasbynens/esrever https://www.npmjs.com/package/stringz https://www.npmjs.com/package/unicode-substring https://www.npmjs.com/package/unicode-string https://www.npmjs.com/package/runes2 and etc
And I don't think somebody will exploit existing broken for UTF-16 behaviour for some proposes. And even so it will be bad practice. Like utilize UBs in C++ for speedup some process "only for MSVC or ICC and Intel Code 2 Duo" compiler for example
And at last we could add --legacy
or something like this flag which cancel all this fixes and revert behaviour to JS/TS. But in my opinion it's unnecessary. Mostly all code aswared libraries try to mimic to original String interface
This is the kind of issue where if something is going to be done, it's easier to do it sooner rather than later. So I'm posting here to make another appeal. AssemblyScript calls itself "A language made for WebAssembly". WebAssembly seeks to make sense on its own terms, rather than behaving like JavaScript for JavaScript's sake.
So, I propose to rename the functions which work in terms of the underlying code-unit concept, such as charCodeAt
, into a String.JS
namespace, alongside the existing String.UTF16
namespace. Alternatively, perhaps call the namespace String.WTF16
. Either way, the main goal is to make a visual separation between functions that expose a specific underlying encoding and functions that don't.
During the last conversation on this topic, we came to the conclusion that it makes sense to create a new addisonal string class "Str / str" which will be completely unicode-aware and possibly even have UTF-8 encoding and at the same time have the most similar interface with classic strings, however without random access like charCodeAt, charAt and etc
The whole UTF-8 vs WTF-16 discussion is super unfortunate for us. AssemblyScript just so happens to be torn in between the two worlds as it both aims to be a language for WebAssembly, with a majority of stakeholders apparently trying to get rid of 16-bit unicode, and a language that looks and feels pretty much like TypeScript. As such it is based upon, and works best with, WebAPIs that are specified and designed for WTF-16, yet it compiles to WebAssembly. Our options are:
It seems there is simply no good decision we can make here, and whatever we do, we'll get ourselves into trouble. Path of least resistance might be 3., but that'll put AS at a disadvantage exactly where it currently excels, so morale to re-implement half of stdlib just to build something suboptimal isn't exactly high. As I said, it's all so unfortunate.
There are indeed several related conversations, but I think the specific issue here can be considered in isolation. Let's forget UTF-8 vs WTF-16 here for a moment, and just focus on "encoding-independent" vs. "encoding-specific" APIs.
The specific change I'm proposing here is just to rename encoding-specific functions so that they're explicit about it.
It's a simple change. It can help users understand when they need to be aware of encoding-level functions, since these functions can be error-prone in a way that encoding-independent functions aren't. And it can give you more flexibility in the future, no matter what you end up deciding to do about encodings in general.
Alright, let's play this through, using String#charCodeAt
as an example. Currently, one would write
var str = "some string";
if (str.charCodeAt(0) == 0x73) {
// ...
}
Can you give me an example of what you are envisioning with namespaces there? In particular I worry that not having a .charCodeAt
anymore will feel alien to TS devs, as would non-standard APIs replacing it.
There are probably multiple ways to do it; I was imagining something similar to the existing String.UTF16.byteLength
. So perhaps it would look like this:
var str = "some string";
if (String.JS.charCodeAt(str, 0) == 0x73) {
// ...
}
Of course, users coming from TS may find this alien or more verbose. However, this is also a great moment to point out that a better way to write this code would be:
var str = "some string";
if (str.startsWith("s")) {
// ...
}
Encoding-independent, easier to read, and more robust in the case where the string is empty :smile: . Of course, this is just a simple example, however it generalizes— a lot of seeming uses for charCodeAt
have better alternatives, and part of the point here is to encourage programmers to use these better alternatives when appropriate. Of course, charCodeAt
can still be available, for when people really need it, but by putting it in context and making users aware of what they're doing when they use it, you can gain more flexibility for the future.
We can't broke compatibility for existing strings, but we could create new subset of string which reimplement all methods as WTF-16-awared and remove rest unsafe operations. And it will also make it possible to almost seamlessly reformat one class of strings into another simply by changing the type declaration. Like:
var strWTF16: str = ...
var isSChar = strWTF16.charCodeAt(0) == 0x73; // compilation error
var strWTF16: str = ...
// If "str" is WTF16 this conversions will cost nothing. But if UTF8 it should call String.UTF8.decode
var isSChar = (strWTF16 as string).charCodeAt(0) == 0x73; // ok.
But with that reasoning, wouldn't .substring
be equally problematic? While there's .codePointAt
to substitute a missing .charCodeAt
, I find it hard to imagine a String
class that doesn't provide a .substring
member but instead enforces the use of String.JS.substring(str, a, b)
everywhere. At least I wouldn't know how to communicate that to users.
Yes; substring
and a few others also expose the encoding.
I'm not deeply familiar with AssemblyScript; I expect there's room for some creativity here. And @MaxGraey's idea of introducing new types looks like it would make some different options available as well.
Another thing that may help is looking at what people are using substring
for. The existing split
can handle a lot of the cases that come to mind. Some additional functions that may be useful might be:
function stripPrefix(prefix: string): string | null
function stripSuffix(suffix: string): string | null
My general feeling there is that restricting or changing access to .substring
most likely goes too far, as it can be used safely and folks expect it to be there. Typically isn't used with arbitrary constants that may split a surrogate pair or similar as well. Something like get the start and end of a region with .indexOf
, and then cutting it out with .substring
for example, independent of the substring's actual encoding length. And if we talk .substring
, we aren't too far away from talking .length
as well. Not sure about how .split
would help.
A deprecation warning on just .charCodeAt
and .charAt
may be justifiable, but everything else, oof, is likely to leave us with a useless string class, or make AS a different language. This is really asking for a lot. An API change one probably wouldn't ask Java or .NET to make, which will both eventually face the same challenge.
That indexOf
example is an example of something you can also do with split
. Instead of "give me the index where this other string appears, and then I'll do a substring there", split
lets you do "split the string where this other string appears".
I think can't quite follow how split
can help there universally. For example, the result will exclude the separator and produce an intermediate garbage array. Say one doesn't know what ???
is and wants the region from hello
to world
inclusive:
var str = "???hello???world???";
var p1 = str.indexOf("hello");
if (~p1) {
let p2 = str.indexOf("world", p1 + "hello".length);
if (~p2) {
return str.substring(p1, p2 + "world".length);
}
}
return null;
What would be a safer but still efficient alternative to this code sample?
The conclusion drawn in https://github.com/AssemblyScript/assemblyscript/issues/1653#issuecomment-773427339 may be of interest in context of what's being discussed here.
"hello" + str.split("hello", 2)[1].split("world", 2)[0] + "world"
Perhaps slightly less efficient due to doing two string concatenations instead of one, but if this is a common case, a function to do this kind of substring in the standard library could fix that.
More realistic scenario. This works for any case and looks quite simple:
function capitalize(str: string): string {
return str.charAt(0).toUpperCase() + str.substring(1);
}
in rust or Go for example which hasn't random access:
pub fn capitalize(s: &str) -> String {
let mut c = s.chars();
match c.next() {
None => String::new(),
Some(f) => f.to_uppercase().collect::<String>() + c.as_str(),
}
}
so it required use iterators which pretty unnatural for strings in JS/TS. I don't know how this better reproduce with new utf-8 str API without using iterators. And anyway it will completely different experience
Yeah; as I said, match
can handle a lot of cases, but not everything.
The AssemblyScript version there doesn't work correctly in various cases. For example, 𑣙
(U+118D9) should be capitalized to 𑢹
(U+118B9), but the AssemblyScript code above doesn't do that. The Rust code you gave does handle that case. I expect there's room for some designing here.
And to be clear, in this issue, I'm not suggesting removing charCodeAt
and substring
; I'm suggesting presenting them in a way which communicates their association to a particular encoding. This could perhaps help programmers be aware of where they need to think about encoding-specific concerns like lone surrogates, create space for people to design new string functions which aren't associated with a particular encoding to handle common use cases, and help set programmer expectations when running AS code in other environments.
For example, 𑣙 (U+118D9) should be capitalized to 𑢹 (U+118B9), but the AssemblyScript code above doesn't do that
Yeah, that's good point. So it could solve via iterator implicitly:
function capitalizeUnicode(str: string): string {
const [firstChar, ...rest] = str;
return firstChar.toUpperCase() + rest.join('');
}
capitalizeUnicode('𑣙𑣙𑣙')
// > 𑢹𑣙𑣙
So probably this is best way how we could handle strings with UTF-8 without introduce some special API
Treating Unicode strings as arrays often leads to bugs where code processes text in some languages correctly but not others. In JavaScript, it's surprising that
"🤦🏼♂️".length == 7
, and the advice to programmers often is: you usually don't want to look at.length
, because it isn't reliably what end users think of as characters, it isn't reliably the number of codepoints, and it isn't reliably related to the display width of the string.Similarly, functions names borrowed from JS using the term "Char", such as
fromCharCode
, are confusing to programmers coming from non-JS languages, since code units aren't always characters.So, what if AssemblyScript moved functions which work in terms of the underlying code-unit concept, such as
charCodeAt
, into aString.JS
namespace, similar to theString.UTF16
namespace? They'd all be available, and easily accessible. But, they'd be visually distinguished from the other string functions, making it clear where code-unit assumptions are being made. It would also leave more conceptual room in the baseString
namespace for new features in the future.Another effect of the name
String.JS
could be to signal to programmers that these functions won't necessarily always be optimal or natural in non-JS embeddings of Wasm, which may give AssemblyScript as a language more implementation flexibility in non-JS environments.All that said, I don't know where AssemblyScript stands on standard library API stability at this time. If breaking changes are out of scope, perhaps some of the above goals could at least be advanced through documentation.