dart-lang / language

Design of the Dart language
Other
2.65k stars 202 forks source link

Add syntax for grapheme clusters literals. #1432

Open Cat-sushi opened 3 years ago

Cat-sushi commented 3 years ago

Currently, grapheme clusters (Characters) are the only way to manipulate natural languages correctly. So, I propose syntax for grapheme clusters literals like g"𠮷野". It might include a proposal that characters extension must be a part of dart:core.

Cat-sushi commented 3 years ago

This proposal is derived from the closed proposal #1428. g"𠮷野".length returns 2 (grapheme clusters), but not 3 (code units).

Cat-sushi commented 3 years ago

I think it should be constant, but I'm not sure it is a good idea. So, I changed the title.

Cat-sushi commented 3 years ago

Naming system of prefix must be arranged with #886 and others if exist.

AKushWarrior commented 3 years ago

I don't know that the g"str" syntax is necessarily in line with dart style conventions to this point, though there is precedent in Rust's byte literal syntax b"str". I might prefer to simply be able to access "words".characters or "words".clusters; that's pretty much how it's handled now with codeunits and runes.

I agree that the characters package should be included as a core package; it provides a fundamental functionality, and it's a lot easier to import "dart:characters" than go to pubspec.yaml, include characters, come back to my file, import the package, and remember why I needed it in the first place.

Cat-sushi commented 3 years ago

I might prefer to simply be able to access "words".characters or "words".clusters; that's pretty much how it's handled now with codeunits and runes

There is a proposal to introduce single code point constant (but not sequence of code points) with similar syntax by core member. Refer #886, in which the necessity of literal is mentioned. "words".characters already exists, which returns a Iterable view of String. Sequence of code units is a default representation of String and String natively provides code unit based API. On the other hand, String.codeUnits generate List<int> in which every single code unit(16 bits) are represented int(64 bits), which have quite different purpose from that of Characters.

As you said, grapheme cluster is fundamental, which deserves literal, I think.

Cat-sushi commented 3 years ago
Characters cs = '𠮷野'; // lint : omit_local_variable_types

can be rewrote to

var cs = g'𠮷野';
lrhn commented 3 years ago

If we move Characters into the platform libraries, then adding a literal for creating (effectively) const Characters(stringLiteral) seems reasonable.

I'm also sure that some will argue that Characters should be the default string literal, and you'd have to write u16"...." to get the current string. (Then u8"...." could be UTF-8 encoded). That's a tough sell, though.

Cat-sushi commented 3 years ago

@lrhn

I'm also sure that some will argue that Characters should be the default string literal, and you'd have to write u16"...." to get the current string. (Then u8"...." could be UTF-8 encoded). That's a tough sell, though.

I knew. I don't request that far.

dnfield commented 3 years ago

It might be nice to have a lint discouraging people from using String.length too. It's almost never what they really want.

lrhn commented 3 years ago

I can assure you, as someone who's written quite a lot of small parsers, that String.length is exactly what I want when I traverse the code units of a string. Parsing JSON, or integer literals, or URLs, or XML, or any other structured textual input which is commonly stored as a String, is quite different from handling user-written text. The Dart String class contains both. The API just happens to be better suited for the former.

A Dart String is a sequence of code units. Any abstraction on top of that is a separate class (Runes, Characters). You can, an should, choose the abstraction you need, but sometimes "sequence of code units" is the abstraction level you need.

A String is not only for text - words and phrases intended to be displayed as such. It supports that as well.

Cat-sushi commented 3 years ago

@dnfield @lrhn

String.length is exactly what I want when I traverse the code units of a string

Yes. The problem is that, String.length is too exposed to average programmers. So, deprecation of String.length and introduction String.size might be a solution. But, that was a discussion at #1428.

This proposal is just for literal and dart:core.