Encourage effective use of Unicode

sunfishcode commented 4 years ago

Treating Unicode strings as arrays often leads to bugs where code processes text in some languages correctly but not others. In JavaScript, it's surprising that "🤦🏼‍♂️".length == 7, and the advice to programmers often is: you usually don't want to look at .length, because it isn't reliably what end users think of as characters, it isn't reliably the number of codepoints, and it isn't reliably related to the display width of the string.

Similarly, functions names borrowed from JS using the term "Char", such as fromCharCode, are confusing to programmers coming from non-JS languages, since code units aren't always characters.

So, what if AssemblyScript moved functions which work in terms of the underlying code-unit concept, such as charCodeAt, into a String.JS namespace, similar to the String.UTF16 namespace? They'd all be available, and easily accessible. But, they'd be visually distinguished from the other string functions, making it clear where code-unit assumptions are being made. It would also leave more conceptual room in the base String namespace for new features in the future.

Another effect of the name String.JS could be to signal to programmers that these functions won't necessarily always be optimal or natural in non-JS embeddings of Wasm, which may give AssemblyScript as a language more implementation flexibility in non-JS environments.

All that said, I don't know where AssemblyScript stands on standard library API stability at this time. If breaking changes are out of scope, perhaps some of the above goals could at least be advanced through documentation.

willemneal commented 4 years ago

I agree. One way to avoid the braking change is to deprecate the current string API that produces a message outlining the move to String.JS and a link to documentation of why.

Also, what is the preferred way of getting the proper length in JS?

MaxGraey commented 4 years ago

In JavaScript, it's surprising that "🤦🏼‍♂️".length == 7

Honestly this is not only problem of JavaScript. But Rust, Swift, Java and all other languages which measure string length in at least code points instead graphemes. All languages exposed some new api for iterate over graphemes. For example Rust:

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    for g in "नमस्ते्".graphemes(true) { // hindi
        println!("grapheme - {}", g);
    }
}

JavaScript will do this (with Intl.Segmenter which support only in Chrome Canary and FF nightly for now):


const segmenter = new Intl.Segmenter("hi", { granularity: "grapheme" });
for (const { segment: g } of segmenter.segment("नमस्ते्")) {
  console.log("grapheme: ", g);
}

MaxGraey commented 4 years ago

I agree. One way to avoid the braking change is to deprecate the current string API that produces a message outlining the move to String.JS and a link to documentation of why.

I don't think we should rework existing String or expose something new which fully unicode aware (Str for example). This will blowup runtime significantally. See comment author of swiftwasm which thinking of switch back to weekness UTF-16 variant of strings due to runtime of Swift is up to 5-6 mb just for hello world.

Also in most of cases we don't really care about fully compatible UTF strings. But if you really need it you could include something like Intl.Segmenter.

MaxGraey commented 4 years ago

By the way, no one language supports all special cases of unicode case mapping. Just most popular like Greek Final Sigma (which context dependent): https://github.com/rust-lang/rust/issues/26035

we also support this: https://github.com/AssemblyScript/assemblyscript/pull/1113

But SpecialCasing.txt contain a lot more special complex cases which nobody handle (expect languages which support full ICU library which increase size of their runtime really significant)

MaxGraey commented 4 years ago

Unicode standard is really mess currently, it has a lot of redundant glyphs, and really large amount of special cases which constantly grow from version to version.

MaxGraey commented 4 years ago

@sunfishcode

Similarly, functions names borrowed from JS using the term "Char", such as fromCharCode, are confusing to programmers coming from non-JS languages, since code units aren't always characters.

Since ES6 JS also support code points via Array.from, str.pointAt, str.codePointAt, String.fromCodePoint methods. other methods like toUpperCase as unicode aware except split. Simple example:

function reverseStringNaive(str) {
  return str.split("").reverse().join("");
}

function reverseStringUnicodeAware(str) {
  return Array.from(str).reverse().join("");
  // or [...str].reverse().join("");
}

console.log(reverseStringNaive("foo 𝌆 bar")); 
// will print:
//> rab �� oof
console.log(reverseStringUnicodeAware("foo 𝌆 bar"));
// will print:
//> rab 𝌆 oof

Also, what is the preferred way of getting the proper length in JS?

@willemneal

// calc length in O(1)
console.log("🤦🏼‍♂️".length);  // "7"

// calc length in O(N)
console.log([..."🤦🏼‍♂️"].length); // "5" and now it align with Rust's "🤦🏼‍♂️".chars().count() result

iterate by code point (unicode aware):

for (const c of "foo 𝌆 bar") {
  console.log(c);
}

So JS/TS/AS contain all instruments for handling with unicode strings and even graphemes. But It don't need in 95% of cases)

MaxGraey commented 4 years ago

Here's what the official ICU documentation says about linrary size:

ICU includes a standard library of data that is about 16 MB in size. Most of this consists of conversion tables and locale information. The data itself is normally placed into a single shared library.

Update: as of ICU 64, the standard data library is over 20 MB in size. We have introduced a new tool, the ICU Data Build Tool, to replace the makefiles explained below and give you more control over what goes into your ICU locale data file.

So some languages (like Swift) use or small_icu mode of ICU or use own custom implementations (Go, Rust, C#, C++) which contain only really necessary unicode property tables compressed as static tries (three staged indirect lookups). We also follow this way.

MaxGraey commented 4 years ago

Regarding string interfaces. What we can do for make it mode safer for users? That's good question. I guess we could declare some methods as deprecate and non recommend for usage (like charCodeAt, String.fromCharCode, split and others). Typescript's definition file supporting /// @deprecate directive comment now. We have own definitions for stdlib so this possible. Secondary we could ban this methods with alternative suggestions in --pedantic mode. @dcodeIO wdyt?

dcodeIO commented 4 years ago

I am not sure about deprecating these, like if one wants to deal with a WTF-16 string, then these are still the way to go I think? For instance, we use them for small strings in the loader because that's faster than piping through a TextDecoder. Of course we can add more details to the documentation, with a neutral link for everyone interested to learn more?

MaxGraey commented 4 years ago

I meant deprecate some string methods inside AssemblyScript in future when Array.from and iterators landed

dcodeIO commented 4 years ago

I am not so sure about that, unless JS itself officially discourages their use. Otherwise we are just creating unnecessary barriers, aren't we?

MaxGraey commented 4 years ago

How about deprecate it only for --pedantic mode?

MaxGraey commented 4 years ago

Btw Rust also required special counting api for code points (aka CCs/Character Codes):

println!("utf8 units (bytes): {}", "🤦🏼‍♂️".len()); // 17  ->  17 bytes
println!("code points: {}", "🤦🏼‍♂️".chars().count()); // 5

JS:

console.log("utf16 units (ushorts):", "🤦🏼‍♂️".length); // 7    "🤦🏼‍♂️".length * 2  ->  14 bytes
console.log("code points:", [..."🤦🏼‍♂️"].length); // 5

So Rust's len() and JS .length just retrieve encoded length in different unit spaces (like ft and m) which could simply converted to bytes and this makes sense. If you need count of character codes you should call different method, if you want count it as visible/writing units (graphemes) you should deal with totally another api.

dcodeIO commented 4 years ago

Really not sure. If the broader ecosystem goes that route, perhaps a diagnostic message about a potential pitfall, but I don't see a reason for asc to spearhead something like that. At this point in time, I'd expect that most users would complain about it.

sunfishcode commented 4 years ago

@willemneal

Also, what is the preferred way of getting the proper length in JS?

The preferred way is to ask a more specific question :wink: . Are you asking for the visual width, the number of user-perceived characters, the number of Unicode code points, or the number of bytes of storage used?

@MaxGraey

I don't think we should rework existing String or expose something new which fully unicode aware

I agree. I'm not looking to add new functionality in this issue, but just to present the current functionality in a different way.

console.log(reverseStringNaive("foo 𝌆 bar")); // will print: //> rab �� oof

This is a good example -- it's hard to see how this behavior helps anyone, except via bug-for-bug compatibility with JS.

@dcodeIO

I am not so sure about that, unless JS itself officially discourages their use. Otherwise we are just creating unnecessary barriers, aren't we?

I'm agree; deprecation feels too strong here. In particular, for functions like split, it is possible to use them correctly, and there's currently no simple alternative.

Renaming split to String.JS.split would mean it remains available, and if in the future someone wanted to add a new split to AssemblyScript which did respect code-point boundaries, there'd be an obvious place in the namespace for it. And that seems like a door worth keeping open -- such a thing would still be familiar to JS programmers, it just wouldn't have surprising edge-case behavior.

MaxGraey commented 4 years ago

We could do even better. We could actually fix String#split and probably other methods which not handle surrogate pairs. We already do similar fixes for array.sort() with default comparator and couple other methods. It's legit due to we don't care about legacy code compatibility. Also all this methods will be behave identical to JS / TS until 0x10FFFF code point. And only keeps old behaviour for s.charCodeAt(i) and String.fromCharCode(...)

dcodeIO commented 4 years ago

But your fix is someone else's bug. The Array#sort problematic is a little different in that it's behavior is just odd and doesn't do anything useful, but String#split is something people got used to and might expect to function exactly that way. What if we provide non-standard but safe alternatives instead, like String#splitCodePoints, String#lengthCodePoints?

MaxGraey commented 4 years ago

but String#split is something people got used to and might expect to function exactly that way

If people use String#split it means they're don't care about surrogate pairs at all. Otherwise they're using something like:

let unsafe = 'Emoji 🤖'.substr(0,7);              // Emoji �
let safe   = [...'Emoji 🤖'].slice(0,7).join(''); // Emoji 🤖
// or do this via regexps

or third-party libraries like: https://github.com/mathiasbynens/esrever https://www.npmjs.com/package/stringz https://www.npmjs.com/package/unicode-substring https://www.npmjs.com/package/unicode-string https://www.npmjs.com/package/runes2 and etc

And I don't think somebody will exploit existing broken for UTF-16 behaviour for some proposes. And even so it will be bad practice. Like utilize UBs in C++ for speedup some process "only for MSVC or ICC and Intel Code 2 Duo" compiler for example

MaxGraey commented 4 years ago

And at last we could add --legacy or something like this flag which cancel all this fixes and revert behaviour to JS/TS. But in my opinion it's unnecessary. Mostly all code aswared libraries try to mimic to original String interface

sunfishcode commented 3 years ago

This is the kind of issue where if something is going to be done, it's easier to do it sooner rather than later. So I'm posting here to make another appeal. AssemblyScript calls itself "A language made for WebAssembly". WebAssembly seeks to make sense on its own terms, rather than behaving like JavaScript for JavaScript's sake.

So, I propose to rename the functions which work in terms of the underlying code-unit concept, such as charCodeAt, into a String.JS namespace, alongside the existing String.UTF16 namespace. Alternatively, perhaps call the namespace String.WTF16. Either way, the main goal is to make a visual separation between functions that expose a specific underlying encoding and functions that don't.

MaxGraey commented 3 years ago

During the last conversation on this topic, we came to the conclusion that it makes sense to create a new addisonal string class "Str / str" which will be completely unicode-aware and possibly even have UTF-8 encoding and at the same time have the most similar interface with classic strings, however without random access like charCodeAt, charAt and etc

dcodeIO commented 3 years ago

The whole UTF-8 vs WTF-16 discussion is super unfortunate for us. AssemblyScript just so happens to be torn in between the two worlds as it both aims to be a language for WebAssembly, with a majority of stakeholders apparently trying to get rid of 16-bit unicode, and a language that looks and feels pretty much like TypeScript. As such it is based upon, and works best with, WebAPIs that are specified and designed for WTF-16, yet it compiles to WebAssembly. Our options are:

Supporting just WTF-16 favors the TypeScript side, but gets us into trouble with the WebAssembly side of things. For instance, I totally regret getting into the Wasm spec discussions already.
Supporting just UTF-8 favors the WebAssembly side, but gets us into trouble with the TypeScript side of things. Like, our users expect us to become actually better at TS support, not worse.
Supporting both encodings in one API bloats binary size, which just so happens to be important on the Web, because most string methods, or APIs consuming strings, must provide two mechanisms to do the same thing.
Supporting both encodings in separate APIs doubles what we have to support, and will probably never be seamless. Can't call one API with a String.WTF16, or another with a String.UTF8.

It seems there is simply no good decision we can make here, and whatever we do, we'll get ourselves into trouble. Path of least resistance might be 3., but that'll put AS at a disadvantage exactly where it currently excels, so morale to re-implement half of stdlib just to build something suboptimal isn't exactly high. As I said, it's all so unfortunate.

sunfishcode commented 3 years ago

There are indeed several related conversations, but I think the specific issue here can be considered in isolation. Let's forget UTF-8 vs WTF-16 here for a moment, and just focus on "encoding-independent" vs. "encoding-specific" APIs.

The specific change I'm proposing here is just to rename encoding-specific functions so that they're explicit about it.

It's a simple change. It can help users understand when they need to be aware of encoding-level functions, since these functions can be error-prone in a way that encoding-independent functions aren't. And it can give you more flexibility in the future, no matter what you end up deciding to do about encodings in general.

dcodeIO commented 3 years ago

Alright, let's play this through, using String#charCodeAt as an example. Currently, one would write

var str = "some string";
if (str.charCodeAt(0) == 0x73) {
  // ...
}

Can you give me an example of what you are envisioning with namespaces there? In particular I worry that not having a .charCodeAt anymore will feel alien to TS devs, as would non-standard APIs replacing it.

sunfishcode commented 3 years ago

There are probably multiple ways to do it; I was imagining something similar to the existing String.UTF16.byteLength. So perhaps it would look like this:

var str = "some string";
if (String.JS.charCodeAt(str, 0) == 0x73) {
  // ...
}

Of course, users coming from TS may find this alien or more verbose. However, this is also a great moment to point out that a better way to write this code would be:

var str = "some string";
if (str.startsWith("s")) {
  // ...
}

Encoding-independent, easier to read, and more robust in the case where the string is empty :smile: . Of course, this is just a simple example, however it generalizes— a lot of seeming uses for charCodeAt have better alternatives, and part of the point here is to encourage programmers to use these better alternatives when appropriate. Of course, charCodeAt can still be available, for when people really need it, but by putting it in context and making users aware of what they're doing when they use it, you can gain more flexibility for the future.

MaxGraey commented 3 years ago

We can't broke compatibility for existing strings, but we could create new subset of string which reimplement all methods as WTF-16-awared and remove rest unsafe operations. And it will also make it possible to almost seamlessly reformat one class of strings into another simply by changing the type declaration. Like:

var strWTF16: str = ...
var isSChar = strWTF16.charCodeAt(0) == 0x73; // compilation error

var strWTF16: str = ...
// If "str" is WTF16 this conversions will cost nothing. But if UTF8 it should call String.UTF8.decode
var isSChar = (strWTF16 as string).charCodeAt(0) == 0x73; // ok.

dcodeIO commented 3 years ago

But with that reasoning, wouldn't .substring be equally problematic? While there's .codePointAt to substitute a missing .charCodeAt, I find it hard to imagine a String class that doesn't provide a .substring member but instead enforces the use of String.JS.substring(str, a, b) everywhere. At least I wouldn't know how to communicate that to users.

sunfishcode commented 3 years ago

Yes; substring and a few others also expose the encoding.

I'm not deeply familiar with AssemblyScript; I expect there's room for some creativity here. And @MaxGraey's idea of introducing new types looks like it would make some different options available as well.

Another thing that may help is looking at what people are using substring for. The existing split can handle a lot of the cases that come to mind. Some additional functions that may be useful might be:

function stripPrefix(prefix: string): string | null
function stripSuffix(suffix: string): string | null

dcodeIO commented 3 years ago

My general feeling there is that restricting or changing access to .substring most likely goes too far, as it can be used safely and folks expect it to be there. Typically isn't used with arbitrary constants that may split a surrogate pair or similar as well. Something like get the start and end of a region with .indexOf, and then cutting it out with .substring for example, independent of the substring's actual encoding length. And if we talk .substring, we aren't too far away from talking .length as well. Not sure about how .split would help.

A deprecation warning on just .charCodeAt and .charAt may be justifiable, but everything else, oof, is likely to leave us with a useless string class, or make AS a different language. This is really asking for a lot. An API change one probably wouldn't ask Java or .NET to make, which will both eventually face the same challenge.

sunfishcode commented 3 years ago

That indexOf example is an example of something you can also do with split. Instead of "give me the index where this other string appears, and then I'll do a substring there", split lets you do "split the string where this other string appears".

dcodeIO commented 3 years ago

I think can't quite follow how split can help there universally. For example, the result will exclude the separator and produce an intermediate garbage array. Say one doesn't know what ??? is and wants the region from hello to world inclusive:

var str = "???hello???world???";
var p1 = str.indexOf("hello");
if (~p1) {
  let p2 = str.indexOf("world", p1 + "hello".length);
  if (~p2) {
    return str.substring(p1, p2 + "world".length);
  }
}
return null;

What would be a safer but still efficient alternative to this code sample?

dcodeIO commented 3 years ago

The conclusion drawn in https://github.com/AssemblyScript/assemblyscript/issues/1653#issuecomment-773427339 may be of interest in context of what's being discussed here.

sunfishcode commented 3 years ago

"hello" + str.split("hello", 2)[1].split("world", 2)[0] + "world"

Perhaps slightly less efficient due to doing two string concatenations instead of one, but if this is a common case, a function to do this kind of substring in the standard library could fix that.

MaxGraey commented 3 years ago

More realistic scenario. This works for any case and looks quite simple:

function capitalize(str: string): string {
  return str.charAt(0).toUpperCase() + str.substring(1);
}

in rust or Go for example which hasn't random access:

pub fn capitalize(s: &str) -> String {
  let mut c = s.chars();
  match c.next() {
     None => String::new(),
     Some(f) => f.to_uppercase().collect::<String>() + c.as_str(),
  }
}

so it required use iterators which pretty unnatural for strings in JS/TS. I don't know how this better reproduce with new utf-8 str API without using iterators. And anyway it will completely different experience

sunfishcode commented 3 years ago

Yeah; as I said, match can handle a lot of cases, but not everything.

The AssemblyScript version there doesn't work correctly in various cases. For example, 𑣙 (U+118D9) should be capitalized to 𑢹 (U+118B9), but the AssemblyScript code above doesn't do that. The Rust code you gave does handle that case. I expect there's room for some designing here.

And to be clear, in this issue, I'm not suggesting removing charCodeAt and substring; I'm suggesting presenting them in a way which communicates their association to a particular encoding. This could perhaps help programmers be aware of where they need to think about encoding-specific concerns like lone surrogates, create space for people to design new string functions which aren't associated with a particular encoding to handle common use cases, and help set programmer expectations when running AS code in other environments.

MaxGraey commented 3 years ago

For example, 𑣙 (U+118D9) should be capitalized to 𑢹 (U+118B9), but the AssemblyScript code above doesn't do that

Yeah, that's good point. So it could solve via iterator implicitly:

function capitalizeUnicode(str: string): string {
  const [firstChar, ...rest] = str;
  return firstChar.toUpperCase() + rest.join('');
}

capitalizeUnicode('𑣙𑣙𑣙')
// > 𑢹𑣙𑣙

So probably this is best way how we could handle strings with UTF-8 without introduce some special API

AssemblyScript / assemblyscript

Encourage effective use of Unicode #1487