dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.61k stars 4.56k forks source link

[Suggestion] Introduce more easy-to-use ways to work with grapheme clusters #77200

Open Gnbrkm41 opened 1 year ago

Gnbrkm41 commented 1 year ago

(Not marked as an API proposal because I think it definitely needs some more designing before putting one up)

When handling Unicode strings, you quickly realise that strings can be split into many different units. For example, there's raw, minimum byte units of the encoding used (char in case of UTF-16 / byte in case of UTF-8 / int in case of UTF-32), there's codepoints (System.Rune) then grapheme clusters ("Text element"). There exists a lot of cases where all those mean the same thing (e.g. normalised strings containing all BMP characters encoded in UTF-16), however not always.

Widespread usage of emojis (but not limited to) in modern applications means developers may be faced with a situation they run into situations where they need to work with individual grapheme clusters, especially in handling arbitrary user-supplied strings for display in UI.

For example, one may want to truncate long user-supplied strings for display on UI, and let users decide if they want to see the full text by clicking "Read more" button. Naive implementation of this would be just taking a substring of the full string... however this proves to be problematic in certain cases involving emojis for example, as shown in the example below (not an endorsement or anything, just something I ran into a few days ago):

Screenshot_20221018-234858_YouTube_1 Screenshot_20221018-234901_YouTube_1

In this example (which is the YouTube application on Android) the πŸ‘‰πŸ» emoji, which consists of U+1F449 White Right Pointing Backhand Index and U+1F3FB Emoji Modifier Fitzpatrick Type-1-2, encoded as surrogate pairs end up truncated & results in U+FFFD Replacement Character getting displayed instead.

While the above example can be solved somewhat by using EnumerateRunes, grapheme clusters are still not considered so you may run into other cases such as πŸ§‘πŸ½β€πŸ‘¦πŸ»β€πŸ§’ which without proper handling of grapheme clusters may end up truncated inappropriately. There also exists examples of multiple codepoints combining into a single grapheme cluster other than just emojis such as Hangul in parts, and alphabets with combining marks, indic consonant clusters etc (although I'm not exactly how common those actually are used).

Other issue regarding this issue outlines some more examples where better handling of grapheme clusters would be required: https://github.com/dotnet/runtime/issues/31642

The above demonstrates how, in most user-facing situations, grapheme cluster is the most logical unit to operate on as it is meant to be displayed / understood to humans as a single 'character' / 'glyph'. Currently, .NET has decent support for handling the raw bytes representation and individual codepoints, however I feel that currently .NET does not expose enough APIs to manipulate grapheme units effectively.

Currently, it appears that the only way to enumerate over the grapheme clusters is by using methods associated with StringInfo, such as StringInfo.GetTextElementEnumerator and StringInfo.GetNextTextElementLength.

However, given how those methods existed from .NET Framework 1.1, those two methods are less than ideal.

And generally, those two methods to work with grapheme clusters are (somewhat) hidden away in System.Globalization namespace, when IMO it really belongs in the String class.

Dart has characters package that allow you to access the sequence of grapheme clusters like str.characters, and as an extreme example Swift makes individual grapheme clusters (named Character) the default unit of manipulation. I believe that .NET should make handling of grapheme clusters more accessible & easier.

Rust: https://crates.io/crates/unicode-segmentation

Initially, I was thinking of essentially providing EnumerateRunes but instead of runes we use grapheme clusters, but I'm not sure if that would provide you enough ability to effectively and easily handle grapheme clusters.

Two approaches I can think of would be follows:

Here's a list of operations on strings, roughly taken from the string documents.

Something I haven't considered is the UTF-8 encoded ReadOnlySpan<byte> strings and ReadOnlySpan<char>s. If we are going to provide those grapheme cluster APIs to string, it might be worthwhile to provide similar APIs for those span-based strings as well.

Any opinions are welcome.

Also, one open question - How should we refer to those "grapheme clusters" in API names and such?

I have mixed feeling about going either TextElement or GraphemeCluster, but I'm personally fine with going with TextElement.

dotnet-issue-labeler[bot] commented 1 year ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-globalization See info in area-owners.md if you want to be subscribed.

Issue Details
(Not marked as an API proposal because I think it definitely needs some more designing before putting one up) When handling Unicode strings, you quickly realise that strings can be split into many different units. For example, there's raw, minimum byte units of the encoding used (`char` in case of UTF-16 / `byte` in case of UTF-8 / `int` in case of UTF-32), there's codepoints (`System.Rune`) then grapheme clusters ("Text element"). There exists a lot of cases where all those mean the same thing (e.g. normalised strings containing all BMP characters encoded in UTF-16), however not always. Widespread usage of emojis (but not limited to) in modern applications means developers may be faced with a situation they run into situations where they need to work with individual grapheme clusters, especially in handling arbitrary user-supplied strings for display in UI. For example, one may want to truncate long user-supplied strings for display on UI, and let users decide if they want to see the full text by clicking "Read more" button. Naive implementation of this would be just taking a substring of the full string... however this proves to be problematic in certain cases involving emojis for example, as shown in the example below (not an endorsement or anything, just something I ran into a few days ago): ![Screenshot_20221018-234858_YouTube_1](https://user-images.githubusercontent.com/42944058/196609936-82536c13-96e3-4f0d-aa11-cd0ff7fee6b8.jpg) ![Screenshot_20221018-234901_YouTube_1](https://user-images.githubusercontent.com/42944058/196609944-d041ae50-31b4-40fe-8ffd-1c0107c9b503.jpg) In this example (which is the YouTube application on Android) the πŸ‘‰πŸ» emoji, which consists of `U+1F449 White Right Pointing Backhand Index` and `U+1F3FB Emoji Modifier Fitzpatrick Type-1-2`, encoded as surrogate pairs end up truncated & results in `U+FFFD Replacement Character` getting displayed instead. While the above example can be *solved* somewhat by using `EnumerateRunes`, grapheme clusters are still not considered so you may run into other cases such as πŸ§‘πŸ½β€πŸ‘¦πŸ»β€πŸ§’ which without proper handling of grapheme clusters may end up truncated inappropriately. There also exists examples of multiple codepoints combining into a single grapheme cluster other than just emojis such as Hangul in parts, and alphabets with combining marks, indic consonant clusters etc (although I'm not exactly how common those actually are used). Other issue regarding this issue outlines some more examples where better handling of grapheme clusters would be required: https://github.com/dotnet/runtime/issues/31642 The above demonstrates how, in most user-facing situations, grapheme cluster is the most logical unit to operate on as it is meant to be displayed / understood to humans as a single 'character' / 'glyph'. Currently, .NET has decent support for handling the raw bytes representation and individual codepoints, however I feel that currently .NET does not expose enough APIs to manipulate grapheme units effectively. Currently, it appears that the only way to enumerate over the grapheme clusters is by using methods associated with `StringInfo`, such as `StringInfo.GetTextElementEnumerator` and `StringInfo.GetNextTextElementLength`. However, given how those methods existed from .NET Framework 1.1, those two methods are less than ideal. * `GetTextElementEnumerator` * is a non-generic `IEnumerator`, so you can't use `foreach` with it / use LINQ etc * (It may be possible to implement IEnumerable on the enumerator itself, however: https://github.com/dotnet/runtime/issues/19423 ) * Has an ugly `object Current` property (which is just `GetTextElement` but in `object`) * The only way to access the individual cluster is to use `GetTextElement`, however this returns a new `string`, meaning that using this could result in a new `string` being allocated, in the worst case for every single `char` (if the original string is made out of only non-combining, non-surrogate characters.) * `GetNextTextElementLength` * You are able to obtain lengths of the individual grapheme clusters, so using this you could write your own non-allocating grapheme cluster enumerator, however involves extra steps & is less intuitive to use And generally, those two methods to work with grapheme clusters are (somewhat) hidden away in System.Globalization namespace, when IMO it really belongs in the String class. [Dart has `characters` package](https://pub.dev/packages/characters) that allow you to access the sequence of grapheme clusters like `str.characters`, and as an extreme example Swift makes individual grapheme clusters (named `Character`) the default unit of manipulation. I believe that .NET should make handling of grapheme clusters more accessible & easier. Rust: https://crates.io/crates/unicode-segmentation Initially, I was thinking of essentially providing `EnumerateRunes` but instead of runes we use grapheme clusters, but I'm not sure if that would provide you enough ability to effectively and easily handle grapheme clusters. Two approaches I can think of would be follows: * Essentially come up with a new `string`-like / `string` wrapper type that considers the default unit of manipulation a single grapheme cluster by default (basically, what Swift did with their strings) * Do we introduce a new type that represents individual grapheme clusters? * Come up with multiple helper methods that operate on `string` / `ReadOnlySpan`s, treating those string / ROSs as grapheme clusters * Basically providing `EnumerateRunes` like methods on string as members / as extension methods Here's a list of operations on strings, roughly taken from the string documents. * Substring * Substring that takes grapheme clusters into considerations * Comparison * Comparison between normalised characters and non-normalised characters? Although, could be argued that invariant culture comparison is enough * Enumeration * Sequential enumeration using enumerators - should be non-allocating if we use `ReadOnlySpan` however is limited in capabilities * Returning a list of indices / `Range`s - might be allocating but not as much * Returning actual `string`s * Indexing * Not exactly sure on this. It would not quite be a random access unless we pre-calculate where the boundaries are... * IsNormalized * Normalize * Insert * IsNullOrEmpty / IsNullOrWhitespace * Join * Searching * Contains - might be enough with invariant culture comparison * IndexOf / LastIndexOf (Any) - Should we return `char` index or index of the grapheme cluster? * StartsWith / EndsWith * Splitting * PadLeft/Right * Remove * Replace * ReplaceLineEndings * ToUpper / ToLower * Trim (Start/End) Something I haven't considered is the UTF-8 encoded `ReadOnlySpan` strings and `ReadOnlySpan`s. If we are going to provide those grapheme cluster APIs to `string`, it might be worthwhile to provide similar APIs for those span-based strings as well. Any opinions are welcome. Also, one open question - How should we refer to those "grapheme clusters" in API names and such? * `TextElement` * Existing APIs call it as such * However isn't there some other types that are also named `TextElement` in some UI frameworks? * "text element" can really mean anything IMO, ranging from individual characters to words / sentences etc. * `GraphemeCluster` * *The most correct* wording, as it appears on the Unicode documents * Kind of a long name? * `Character` * Some other language libraries refer to it as `Character` * but we have `char` which is way different from a grapheme cluster I have mixed feeling about going either `TextElement` or `GraphemeCluster`, but I'm personally fine with going with `TextElement`.
Author: Gnbrkm41
Assignees: -
Labels: `area-System.Globalization`, `untriaged`
Milestone: -
tarekgh commented 1 year ago

CC @GrabYourPitchforks

Neme12 commented 10 months ago

Related issue: #91003