Closed lrhn closed 1 year ago
We hope to decide on an approach for this during Q4 2018, so inputs are welcome.
There are many different possible approaches, but we will need a design that is consistent and usable, efficiently implementable (even in JavaScript), that it is possible to migrate existing code to, and that isn't too blocking for other possible future language features (for example, let's not use all remaining ASCII characters for a new string syntax, we want some left for other things too!)
If we need breaking language changes to fully use new string features that we want to introduce, one way to make that breakage migratable could be to make the change start as opt-in on a per library basis, as long as non-migrated and migrated code can still co-exist in the same program. (This kind of language-change opt-in feature is something we will likely need for other changes too, so it's on our roadmap anyway).
cc @rakudrama
My vote: make a new class called Text
, brand it as the main interface for human-readable text processing, and rebrand String
as a low-level data type.
Why: correct Unicode processing must be easier than incorrect Unicode processing. If grapheme cluster processing is anything less than a first-class citizen, the default for developers will be to perform incorrect, UTF-16 processing, and we will continue to see preventable bugs for years to come.
Text
is a high-profile class name in the Flutter framework, so I'd rather we didn't use Text
.
One problem that will have to be addressed with a brand-new class is how does it work with JavaScript interop. It would be pain to require user to wrap and unwrap strings and it is not clear we could do that automatically.
I like the idea of a low level and high level interface for strings, and I agree with @sffc that it has to be easier to do things correctly than incorrectly.
Maybe the low level parts of the current String interface (all the code unit calls, runes
, etc) could move to a class called string
or str
(named along the lines of int
and double
), and the String class would only deal with grapheme clusters in its API, but have a low level string
accessor. Since actions on the String class operate on a string internally, converting to JavaScript should be possible.
This would definitely be a breaking change, but the vast majority of call sites wouldn't have to stop using String, only those that used API that moved to the low level string
class.
If we're comfortable making a breaking change, then the low-hanging fruit would be to eliminate the []
operator on the String class. This is the easiest place for a programmer to make an error.
If we did this, then adding the graphemeClusters()
getter on String would be fine, essentially putting grapheme cluster iteration into the same category as code point iteration ("runes" in Dart) and code unit iteration. A new class would not be needed.
This being said, although []
is in my opinion the worst offender on the current String class, there are other issues, too, that we may want to consider solving.
I think we need a much shorter phrase than s.graphemeClusters()...
- something like s.runes...
.
If we can't think if anything better than s.characters...
we should use that.
indexOf
and substring
are worse than []
since they are used more, and often with arithmetic that makes unwarranted assumptions, e.g. afterABC = s.substring(s.indexOf('abc') + 3)
.
Indexes are bad, and all operations taking or producing an index are suspect.
This is why I experimented with an index-free API .
padLeft
and padRight
can be operations on any view of the string, the kind of view determines how to measure the number of elements in the input. e.g.
s.characters.padLeft
s.runes.padLeft
s.codeUnits.padLeft
s.padLeft
- deprecated, currently an alias for s.codeUnits.padLeft
, could be changed to be an alias for s.characters.padLeft
.
('ab'.padLeft(5, 'xx')
already produces 'xxxxxxab'
since this is more versatile, see documentation)
With operations that match elements (split, replace, etc), we can put the extended grapheme cluster versions on s.characters
as suggested for padLeft
. One issue here is how to do the matching - do we assume that the arguments all have the same normalization?
If we're comfortable making a breaking change, then the low-hanging fruit would be to eliminate the [] operator on the String class. This is the easiest place for a programmer to make an error.
Do we have any data on how many uses of the existing code-point based []
are incorrect? Many strings are known by design to not contain arbitrary human text. Declaring all of those uses wrong sounds a lot like declaring all uses of +
wrong because Dart doesn't have arbitrary-precision integers.
Do we know what fraction of strings in an application contain human text versus other things (database column names, JSON map keys, enum names, HTML tag names, etc.)?
As a flutter developer I personally make heavy use of trim, substring, and more methods many of you have labeled as "wrong" with many strings in my app so the words "breaking change" make me nervous to read; but I'm all for getting grapheme clusters to work. Also, @munificent, for whatever my anecdote is worth, most of the strings I deal with are human readable. Thanks for your efforts! I'm eagerly waiting for a solution for this as it's required for me to release. Breaking change or not.
I agree with @rakudrama: graphemeClusters
is a horrible name, but I disagree that all code with indices is suspect. Sure, indexing operations might be slower with grapheme clusters, but they will solve the problem the way developers expect. A lot can be done behind the scenes to make it less of an impact on the performance to make the common case relatively fast (detecting and special-casing when strings don't contain extended sequences or are 7-bit ASCII), and it's how people expect to interact with strings.
Code like afterABC = s.substring(s.indexOf('abc') + 3)
needs to be allowed: it's the expected way to manipulate strings. A breaking change is one thing, but to require a complete rewrite of every string manipulation routine to not use indices is a non-starter for most developers. I think I should be able to manipulate a string by grapheme clusters with an similar or identical API to a "fast" string, even if it's O(n) instead of O(1) to access a character.
The way to handle the issue of performance is not to limit the API drastically, but to give the developer the choice: Use a string type/view that is appropriate for the task. For example, AsciiString
for pure ASCII manipulation, Utf16String
for codepoint manipulation, and CharacterString
(grapheme clusters) for when you must work at a semantic level of human-understood characters.
We don't stop people from doing a List.contains()
, when a Set
or Map
would be faster: sometimes it's the right choice given all the available information. Why should we make that decision for people who need grapheme clusters?
I agree with everything Greg said. We should get users using best practices by making those practices easier not by deliberately making other practices harder. Deliberately lowering usability of an API almost never works out well.
There was a time in Dart's history where List
didn't have removeAt(index)
because removing a single element from a list is O(n) and slow and you shouldn't do that. Instead you had to do removeRange(index, 1)
. It was a bad design and users rightly chastised us for it until the API was fixed. See also: disallowing +
on strings.
A key virtue of Dart is approachability. The idioms you know from other mainstream languages often work in Dart too so you can get up and running quickly without having to learn the "Dart way" of doing something. We should be very cautious about sacrificing that.
Although it can be a pain to give up integer indexing for strings, the Swift language did seem to get grapheme clusters right, and it would be worth considering following the same approach. The advantage of being able to iterate grapheme clusters and avoid bugs related to integer indexing is well worth the pain in my opinion. See https://oleb.net/blog/2017/11/swift-4-strings/
https://github.com/dart-lang/language/issues/685 is our current attempt at supporting this via a package
One other thing worth mentioning: the definition for code point is well-defined and very unlikely to change, but the definition for grapheme cluster changes from Unicode release to Unicode release. So, different versions of Dart would have different behaviors depending on which Unicode version was bundled.
In my opinion, it would not be unreasonable for Dart strings to obey code point boundaries by default, which handles many use cases, and defer to third-party libraries to handle grapheme cluster boundaries.
We now have a package:characters
which implement grapheme cluster based operations on strings.
The Dart String class is a sequence of UTF-16 code units (aka. 16-bit integers). It has a
runes
getter which provides a way to iterate the string as code points. However, that is not sufficient to perform operations which treat the string content as human readable Unicode text, because the unit of representation for that is an extended grapheme cluster which can be more than one code unit. The most traditional example is the string"e\u0301"
which contains only one grapheme cluster (the U+0301 code point is the [combining acute accent](accent aigu combining mark) which combines with the priore
to designate the glyphé
). More complicated examples include combining emojis or country-codes (flag emojis).Users currently cannot work with strings at the grapheme cluster level. This leads to tricky bugs where tests work for simple examples, but the program fails badly when it encounters real-life text.
The Dart
String
class should, at the very least, provide a way to iterate the string as a sequence of grapheme clusters. There should probably also be other operations on the grapheme cluster sequence, so users won't have to do everything manually. The exact operations and API will need to be designed.It might also be useful to make some changes to the
String
class, or add other related functionality in separate libraries or packages.I've collected a number of ideas, wishes and concerns about such changes in a document.
A minimal solution to this issue would be a
graphemeClusters
getter onString
which provides an iterable over "grapheme clusters". We believe this to be practically possible, even when compiled to JavaScript.