dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.94k stars 4.64k forks source link

[API Proposal]: Add modern, span-based alternative to `StringInfo.GetTextElementEnumerator` #91003

Open Neme12 opened 1 year ago

Neme12 commented 1 year ago

Background and motivation

There's an existing API, StringInfo.GetTextElementEnumerator, which allows us to enumerate textual elements (grapheme clusters, or in other words, the individual characters that actually get printed on screen, as people usually think of them, and which can potentially consist of multiple Runes), but this enumerator returns string instances as opposed to a span of the original string, adding unnecessary copying and allocations that could be avoided. It's also cumbersome because it's a non-generic enumerator, and doesn't even have a GetEnumerator() method, so it cannot be used in a foreach loop.

This is the existing API:

namespace System.Globalization;

public class StringInfo
{
    public static TextElementEnumerator GetTextElementEnumerator(string str);
    public static TextElementEnumerator GetTextElementEnumerator(string str, int index);
}

public partial class TextElementEnumerator : System.Collections.IEnumerator
{
    internal TextElementEnumerator();
    public object Current { get; }
    public int ElementIndex { get; }
    public string GetTextElement();
    public bool MoveNext();
    public void Reset();
}

API Proposal

I'm proposing adding a modern replacement for this API that uses a span. I also propose adding it next to the existing EnumerateRunes methods to make it more discoverable, and let people think about which to choose when they see these two methods next to each other.

 namespace System;

 public sealed partial class String
 {
     public StringRuneEnumerator EnumerateRunes();
+    public StringTextElementEnumerator EnumerateTextElements();
 }

 public static partial class MemoryExtensions
 {
     public static SpanRuneEnumerator EnumerateRunes(this ReadOnlySpan<char> span);
     public static SpanRuneEnumerator EnumerateRunes(this Span<char> span);
+    public static SpanTextElementEnumerator EnumerateTextElements(this ReadOnlySpan<char> span);
 }

 namespace System.Text;

 public struct StringRuneEnumerator : IEnumerable<Rune>, IEnumerator<Rune>
 {
     public Rune Current { get; }
     public StringRuneEnumerator GetEnumerator();
     public bool MoveNext();
 }

+public struct StringTextElementEnumerator
+{
+    public ReadOnlySpan<char> Current { get; }
+    public StringTextElementEnumerator GetEnumerator();
+    public bool MoveNext();
+}

 public ref struct SpanRuneEnumerator
 {
     public Rune Current { get; }
     public SpanRuneEnumerator GetEnumerator();
     public bool MoveNext();
 }

+public ref struct SpanTextElementEnumerator
+{
+    public ReadOnlySpan<char> Current { get; }
+    public SpanTextElementEnumerator GetEnumerator();
+    public bool MoveNext();
+}

API Usage

// e.g. properly reversing a string using this API:

void Reverse(string str)
{
    var stringBuilder = new StringBuilder();

    foreach (var character in str.EnumerateTextElements())
    {
        stringBuilder.Insert(0, character);
    }

    return stringBuilder.ToString();
}

Alternative Designs

No response

Risks

No response

GrabYourPitchforks commented 1 year ago

FYI there already is a span based API available here: https://learn.microsoft.com/en-us/dotnet/api/system.globalization.stringinfo.getnexttextelementlength?view=net-7.0

And a strrev function available here: https://apisof.net/catalog/a7f49f7b-010e-73f5-6374-dfd483f88ab7

Neme12 commented 1 year ago

Well, I know about that one, but that's not really a span-based API, that requires doing it manually. I wish there was an enumerator. The string reverse was just an example of something that requires enumerating the text elements as opposed to Runes.

Neme12 commented 1 year ago

Oh, I see, you meant the source being a span. What I meant by span-based is the enumeration of ReadOnlySpan<char> items as opposed to string items.

Neme12 commented 1 year ago

This is related to #77200

tarekgh commented 1 year ago

Oh, I see, you meant the source being a span. What I meant by span-based is the enumeration of ReadOnlySpan items as opposed to string items.

The APIs that @GrabYourPitchforks pointed at https://learn.microsoft.com/en-us/dotnet/api/system.globalization.stringinfo.getnexttextelementlength?view=net-7.0 can give you the length and use it as a span. Something like:

            ReadOnlySpan<char> s = "Hello World!";

            int clusterLength;
            while ((clusterLength = StringInfo.GetNextTextElementLength(s)) > 0)
            {
                Console.WriteLine($"{s.Slice(0, clusterLength)}");
                s = s.Slice(clusterLength);
            }
Neme12 commented 1 year ago

Yes, I know about that API. I'm proposing having an enumerator. The old one is unusable for spans.

tarekgh commented 1 year ago

Why you need a new API then?

Neme12 commented 1 year ago

Because the old TextElementEnumerator is allocating and unusable with span. It would be good to have a span-based replacement. Having an enumerator is much more convenient than doing it manually. For the same reason that EnumerateRunes exists, and Rune.GetRuneAt wasn't enough, even though it's sufficient for enumeration, because it was recognized that having an enumerator is more convenient.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-globalization See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation There's an existing API, `StringInfo.GetTextElementEnumerator`, which allows us to enumerate textual elements (grapheme clusters, or in other words, the individual characters that actually get printed on screen, as people usually think of them, and which can potentially consist of multiple `Rune`s), but this enumerator returns `string` instances as opposed to a span of the original string, adding unnecessary copying and allocations that could be avoided. It's also cumbersome because it's a non-generic enumerator, and doesn't even have a `GetEnumerator()` method, so it cannot be used in a foreach loop. This is the existing API: ```c# namespace System.Globalization; public class StringInfo { public static TextElementEnumerator GetTextElementEnumerator(string str); public static TextElementEnumerator GetTextElementEnumerator(string str, int index); } public partial class TextElementEnumerator : System.Collections.IEnumerator { internal TextElementEnumerator(); public object Current { get; } public int ElementIndex { get; } public string GetTextElement(); public bool MoveNext(); public void Reset(); } ``` ### API Proposal I'm proposing adding a modern replacement for this API that uses a span. I also propose adding it next to the existing `EnumerateRunes` methods to make it more discoverable, and let people think about which to choose when they see these two methods next to each other. ```diff namespace System; public sealed partial class String { public StringRuneEnumerator EnumerateRunes(); + public StringTextElementEnumerator EnumerateTextElements(); } public static partial class MemoryExtensions { public static SpanRuneEnumerator EnumerateRunes(this ReadOnlySpan span); public static SpanRuneEnumerator EnumerateRunes(this Span span); + public static SpanTextElementEnumerator EnumerateTextElements(this ReadOnlySpan span); + public static SpanTextElementEnumerator EnumerateTextElements(this Span span); } namespace System.Text; public struct StringRuneEnumerator : IEnumerable, IEnumerator { public Rune Current { get; } public StringRuneEnumerator GetEnumerator(); public bool MoveNext(); } +public struct StringTextElementEnumerator +{ + public ReadOnlySpan Current { get; } + public StringTextElementEnumerator GetEnumerator(); + public bool MoveNext(); +} public ref struct SpanRuneEnumerator { public Rune Current { get; } public SpanRuneEnumerator GetEnumerator(); public bool MoveNext(); } +public ref struct SpanTextElementEnumerator +{ + public ReadOnlySpan Current { get; } + public SpanTextElementEnumerator GetEnumerator(); + public bool MoveNext(); +} ``` ### API Usage ```csharp // e.g. properly reversing a string using this API: void Reverse(string str) { var stringBuilder = new StringBuilder(); foreach (var character in str.EnumerateTextElements()) { stringBuilder.Insert(0, character); } return stringBuilder.ToString(); } ``` ### Alternative Designs _No response_ ### Risks _No response_
Author: Neme12
Assignees: -
Labels: `api-suggestion`, `area-System.Globalization`, `needs-area-label`
Milestone: Future
Neme12 commented 1 year ago

See also #77200. I agree with many of these points:

Currently, it appears that the only way to enumerate over the grapheme clusters is by using methods associated with StringInfo, such as StringInfo.GetTextElementEnumerator and StringInfo.GetNextTextElementLength.

However, given how those methods existed from .NET Framework 1.1, those two methods are less than ideal.

  • GetTextElementEnumerator

    • is a non-generic IEnumerator, so you can't use foreach with it / use LINQ etc

    • (It may be possible to implement IEnumerable on the enumerator itself, however: Proposal: Better API for StringInfo.GetTextElementEnumerator #19423 )

    • Has an ugly object Current property (which is just GetTextElement but in object)

    • The only way to access the individual cluster is to use GetTextElement, however this returns a new string, meaning that using this could result in a new string being allocated, in the worst case for every single char (if the original string is made out of only non-combining, non-surrogate characters.)

  • GetNextTextElementLength

    • You are able to obtain lengths of the individual grapheme clusters, so using this you could write your own non-allocating grapheme cluster enumerator, however involves extra steps & is less intuitive to use

And generally, those two methods to work with grapheme clusters are (somewhat) hidden away in System.Globalization namespace, when IMO it really belongs in the String class.

IMHO, there should be a more usable, more friendly and a first-class way of dealing with grapheme clusters in .NET than just some length check and an old, non-generic enumerator hidden away somewhere in System.Globalization in a class that isn't discoverable at all. I feel like this should be a first-class citizen in .NET as much as Runes are, and should be exposed as extension methods next to EnumerateRunes.

Neme12 commented 1 year ago

And IMHO there should be a lot more than just enumeration - common operations like Substring, IndexOf, etc. that are grapheme cluster bassed, as was mentioned in #77200. Most of the time when developers work with linguistic text they actually need to perform these operations on grapheme clusters and not on chars. Code that uses chars works most of the time, but not all the time (especially nowadays when emojis, which can even span multiple Runes, are really common). These are really common operations and there is no first-class support for them in .NET. I guess everyone just accepted the reality that most developers won't write the correct code anyway and we can't teach them better ways, char is good enough, or I don't know :/

There should be a lot more samples that use string operations on text elements than on chars etc, because that's the correct way to do string operations on actual human-readable text. Yes, developers get this wrong 99% of the time, but can't we at least show them how to do it the right way? And more importantly, provide easy to use APIs to do things the right way?

IMHO even basic samples like https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/strings/ should really be showing Substring etc using grapheme clusters, and not chars. The fact that there are none and they all use char-based slices etc leads developers down the wrong path. And the fact that .NET doesn't even provide good APIs for being able to do this the right way is really sad.

tarekgh commented 1 year ago

It may be worth thinking to have a TextBreaker class which can do different operations (enumeration, searching, ...etc.).

GrabYourPitchforks commented 1 year ago

FWIW I don't have a strong opinion on how the team should triage this. If the area owners think it's useful, great!

My earlier comment was just to point out that lacking an official first class API, it would technically be feasible to implement it oneself if needed by relying on existing API surface. But I can definitely see the argument that doing that oneself would involve much boilerplate code and would be annoying.

Neme12 commented 1 year ago

it would technically be feasible to implement it oneself if needed by relying on existing API surface

I already have my own extension methods and utilities to work with grapheme clusters, but I feel like I shouldn't have to and this is something that should be built-in. Swift's string operations work with grapheme clusters by default - indexing, substring, etc, and they even have a type that represents a grapheme cluster, and simply call it Character. That's the proper way to do it, not char like we have :/

I'm thinking EnumerateCharacters might be a better name than EnumerateTextElements, because that's what these thing are - characters, human readable characters as they actually appear on screen. "Text element" is kind of vague. Swift calls it Character. Although that might be confused with char.

It may be worth thinking to have a TextBreaker class

Yes, there are a lot more operations that are commonly needed other than enumeration - e.g. getting the length in characters, getting a character at a specified index, IndexOf where the index is the index in the number of characters, not in chars, substring, etc... as was mentioned in #77200. These are common operations and there's no easy way to do them in .NET on actual characters as opposed to code points, or chars. You almost never want to work with individual chars, do substrings in chars, etc, unless you're working with ASCII... if you don't want your code to be broken and want your application to be correct with all possible languages and to work well with emojis etc. I'm jealous of Swift on this because they do all this correctly by default. (And .NET doesn't even have a way to do these things, let alone it being the default way of working with string 😔) .NET only has some basic helpers and an outdated .NET Framework 1.1 enumerator hidden somewhere in an unknown class in the Globalization namepace (and I feel like it really shouldn't be there - it has little to do with globalization because the result doesn't depend on the culture. it's just basic Unicode operations. It should be more like in the System.Text namespace).

GrabYourPitchforks commented 1 year ago

It may be worth thinking to have a TextBreaker class which can do different operations (enumeration, searching, ...etc.).

I've relied on the ICU UBRK_LINE iterator for another personal project I'm working on and ended up writing a C# wrapper around ICU's implementation. Having this functionality in an OOB package would be useful.

Neme12 commented 1 year ago

What's also missing is a UTF-8 overload of StringInfo.GetNextTextElementLength.

Neme12 commented 1 year ago

I'm thinking EnumerateCharacters might be a better name than EnumerateTextElements, because that's what these thing are - characters, human readable characters as they actually appear on screen. "Text element" is kind of vague. Swift calls it Character. Although that might be confused with char.

Another option might be EnumerateGraphemes. While Unicode calls them grapheme clusters, they are really just graphemes. This would be the most descriptive & precise name. I dislike "text elements" because it's really vague - even a whole word or a sentence could be considered text elements. And AFAIK, it's not a term that any other platforms use either, just an odd name invented by .NET.

Neme12 commented 1 year ago

There is a similar issue to this one: #19423

Tragetaschen commented 9 months ago

I just very much missed the UTF-8 GetNextTextElementLength overload, especially since all the necessary pieces are there, but oh so slightly out of reach between being internal and unusable with reflection due to ReadOnlySpan.

tats-u commented 5 months ago

The current TextElementEnumerator is useless. It does not accept ReadOnlySpan<char>, and yields object, not even string. It must have yielded ReadOnlySpan<char> to avoid unnecessary heap allocations instead ideally. GetNextTextElementLength is just a tool to create such a new iterator by yourself, and missing the overload (ReadOnlySpan<char>, int).

Neme12 commented 5 months ago

I just very much missed the UTF-8 GetNextTextElementLength overload, especially since all the necessary pieces are there, but oh so slightly out of reach between being internal and unusable with reflection due to ReadOnlySpan.

Yeah, I needed that too. The implementation is written generically over a Rune decoder so adding it would just be a one line implementation. But it's not exposed :( But I also needed a UTF-32/Rune version. If only there was an API that specifically takes a ReadOnlySpan<T> and a decoder delegate like the internal one.