Dart Strings should support Unicode grapheme cluster operations

lrhn commented 6 years ago

The Dart String class is a sequence of UTF-16 code units (aka. 16-bit integers). It has a runes getter which provides a way to iterate the string as code points. However, that is not sufficient to perform operations which treat the string content as human readable Unicode text, because the unit of representation for that is an extended grapheme cluster which can be more than one code unit. The most traditional example is the string "e\u0301" which contains only one grapheme cluster (the U+0301 code point is the [combining acute accent](accent aigu combining mark) which combines with the prior e to designate the glyph é). More complicated examples include combining emojis or country-codes (flag emojis).

Users currently cannot work with strings at the grapheme cluster level. This leads to tricky bugs where tests work for simple examples, but the program fails badly when it encounters real-life text.

The Dart String class should, at the very least, provide a way to iterate the string as a sequence of grapheme clusters. There should probably also be other operations on the grapheme cluster sequence, so users won't have to do everything manually. The exact operations and API will need to be designed.

It might also be useful to make some changes to the String class, or add other related functionality in separate libraries or packages.

I've collected a number of ideas, wishes and concerns about such changes in a document.

A minimal solution to this issue would be a graphemeClusters getter on String which provides an iterable over "grapheme clusters". We believe this to be practically possible, even when compiled to JavaScript.

lrhn commented 6 years ago

We hope to decide on an approach for this during Q4 2018, so inputs are welcome.

There are many different possible approaches, but we will need a design that is consistent and usable, efficiently implementable (even in JavaScript), that it is possible to migrate existing code to, and that isn't too blocking for other possible future language features (for example, let's not use all remaining ASCII characters for a new string syntax, we want some left for other things too!)

If we need breaking language changes to fully use new string features that we want to introduce, one way to make that breakage migratable could be to make the change start as opt-in on a per library basis, as long as non-migrated and migrated code can still co-exist in the same program. (This kind of language-change opt-in feature is something we will likely need for other changes too, so it's on our roadmap anyway).

leafpetersen commented 6 years ago

cc @rakudrama

sffc commented 6 years ago

My vote: make a new class called Text, brand it as the main interface for human-readable text processing, and rebrand String as a low-level data type.

Why: correct Unicode processing must be easier than incorrect Unicode processing. If grapheme cluster processing is anything less than a first-class citizen, the default for developers will be to perform incorrect, UTF-16 processing, and we will continue to see preventable bugs for years to come.

Hixie commented 6 years ago

Text is a high-profile class name in the Flutter framework, so I'd rather we didn't use Text.

rakudrama commented 6 years ago

One problem that will have to be addressed with a brand-new class is how does it work with JavaScript interop. It would be pain to require user to wrap and unwrap strings and it is not clear we could do that automatically.

gspencergoog commented 6 years ago

I like the idea of a low level and high level interface for strings, and I agree with @sffc that it has to be easier to do things correctly than incorrectly.

Maybe the low level parts of the current String interface (all the code unit calls, runes, etc) could move to a class called string or str (named along the lines of int and double), and the String class would only deal with grapheme clusters in its API, but have a low level string accessor. Since actions on the String class operate on a string internally, converting to JavaScript should be possible.

This would definitely be a breaking change, but the vast majority of call sites wouldn't have to stop using String, only those that used API that moved to the low level string class.

sffc commented 6 years ago

If we're comfortable making a breaking change, then the low-hanging fruit would be to eliminate the [] operator on the String class. This is the easiest place for a programmer to make an error.

If we did this, then adding the graphemeClusters() getter on String would be fine, essentially putting grapheme cluster iteration into the same category as code point iteration ("runes" in Dart) and code unit iteration. A new class would not be needed.

This being said, although [] is in my opinion the worst offender on the current String class, there are other issues, too, that we may want to consider solving.

padLeft/padRight: Do these operate in terms of code units? Maybe make them operate in grapheme clusters instead?
split, replace, and similar functions: Can we make sure these methods won't create substrings that split grapheme clusters apart?

rakudrama commented 6 years ago

I think we need a much shorter phrase than s.graphemeClusters()... - something like s.runes.... If we can't think if anything better than s.characters... we should use that.

indexOf and substring are worse than [] since they are used more, and often with arithmetic that makes unwarranted assumptions, e.g. afterABC = s.substring(s.indexOf('abc') + 3). Indexes are bad, and all operations taking or producing an index are suspect. This is why I experimented with an index-free API .

padLeft and padRight can be operations on any view of the string, the kind of view determines how to measure the number of elements in the input. e.g. s.characters.padLeft s.runes.padLeft s.codeUnits.padLeft s.padLeft - deprecated, currently an alias for s.codeUnits.padLeft, could be changed to be an alias for s.characters.padLeft. ('ab'.padLeft(5, 'xx') already produces 'xxxxxxab' since this is more versatile, see documentation)

With operations that match elements (split, replace, etc), we can put the extended grapheme cluster versions on s.characters as suggested for padLeft. One issue here is how to do the matching - do we assume that the arguments all have the same normalization?

munificent commented 6 years ago

If we're comfortable making a breaking change, then the low-hanging fruit would be to eliminate the [] operator on the String class. This is the easiest place for a programmer to make an error.

Do we have any data on how many uses of the existing code-point based [] are incorrect? Many strings are known by design to not contain arbitrary human text. Declaring all of those uses wrong sounds a lot like declaring all uses of + wrong because Dart doesn't have arbitrary-precision integers.

Do we know what fraction of strings in an application contain human text versus other things (database column names, JSON map keys, enum names, HTML tag names, etc.)?

adriancmurray commented 6 years ago

As a flutter developer I personally make heavy use of trim, substring, and more methods many of you have labeled as "wrong" with many strings in my app so the words "breaking change" make me nervous to read; but I'm all for getting grapheme clusters to work. Also, @munificent, for whatever my anecdote is worth, most of the strings I deal with are human readable. Thanks for your efforts! I'm eagerly waiting for a solution for this as it's required for me to release. Breaking change or not.

gspencergoog commented 6 years ago

I agree with @rakudrama: graphemeClusters is a horrible name, but I disagree that all code with indices is suspect. Sure, indexing operations might be slower with grapheme clusters, but they will solve the problem the way developers expect. A lot can be done behind the scenes to make it less of an impact on the performance to make the common case relatively fast (detecting and special-casing when strings don't contain extended sequences or are 7-bit ASCII), and it's how people expect to interact with strings.

Code like afterABC = s.substring(s.indexOf('abc') + 3) needs to be allowed: it's the expected way to manipulate strings. A breaking change is one thing, but to require a complete rewrite of every string manipulation routine to not use indices is a non-starter for most developers. I think I should be able to manipulate a string by grapheme clusters with an similar or identical API to a "fast" string, even if it's O(n) instead of O(1) to access a character.

The way to handle the issue of performance is not to limit the API drastically, but to give the developer the choice: Use a string type/view that is appropriate for the task. For example, AsciiString for pure ASCII manipulation, Utf16String for codepoint manipulation, and CharacterString (grapheme clusters) for when you must work at a semantic level of human-understood characters.

We don't stop people from doing a List.contains(), when a Set or Map would be faster: sometimes it's the right choice given all the available information. Why should we make that decision for people who need grapheme clusters?

munificent commented 6 years ago

I agree with everything Greg said. We should get users using best practices by making those practices easier not by deliberately making other practices harder. Deliberately lowering usability of an API almost never works out well.

There was a time in Dart's history where List didn't have removeAt(index) because removing a single element from a list is O(n) and slow and you shouldn't do that. Instead you had to do removeRange(index, 1). It was a bad design and users rightly chastised us for it until the API was fixed. See also: disallowing + on strings.

A key virtue of Dart is approachability. The idioms you know from other mainstream languages often work in Dart too so you can get up and running quickly without having to learn the "Dart way" of doing something. We should be very cautious about sacrificing that.

suragch commented 5 years ago

Although it can be a pain to give up integer indexing for strings, the Swift language did seem to get grapheme clusters right, and it would be worth considering following the same approach. The advantage of being able to iterate grapheme clusters and avoid bugs related to integer indexing is well worth the pain in my opinion. See https://oleb.net/blog/2017/11/swift-4-strings/

mit-mit commented 4 years ago

https://github.com/dart-lang/language/issues/685 is our current attempt at supporting this via a package

sffc commented 4 years ago

One other thing worth mentioning: the definition for code point is well-defined and very unlikely to change, but the definition for grapheme cluster changes from Unicode release to Unicode release. So, different versions of Dart would have different behaviors depending on which Unicode version was bundled.

In my opinion, it would not be unreasonable for Dart strings to obey code point boundaries by default, which handles many use cases, and defer to third-party libraries to handle grapheme cluster boundaries.

lrhn commented 4 years ago

We now have a package:characters which implement grapheme cluster based operations on strings.

dart-lang / language

Dart Strings should support Unicode grapheme cluster operations #34