Open Neme12 opened 1 year ago
FYI there already is a span based API available here: https://learn.microsoft.com/en-us/dotnet/api/system.globalization.stringinfo.getnexttextelementlength?view=net-7.0
And a strrev function available here: https://apisof.net/catalog/a7f49f7b-010e-73f5-6374-dfd483f88ab7
Well, I know about that one, but that's not really a span-based API, that requires doing it manually. I wish there was an enumerator. The string reverse was just an example of something that requires enumerating the text elements as opposed to Rune
s.
Oh, I see, you meant the source being a span. What I meant by span-based is the enumeration of ReadOnlySpan<char>
items as opposed to string
items.
This is related to #77200
Oh, I see, you meant the source being a span. What I meant by span-based is the enumeration of ReadOnlySpan
items as opposed to string items.
The APIs that @GrabYourPitchforks pointed at https://learn.microsoft.com/en-us/dotnet/api/system.globalization.stringinfo.getnexttextelementlength?view=net-7.0 can give you the length and use it as a span. Something like:
ReadOnlySpan<char> s = "Hello World!";
int clusterLength;
while ((clusterLength = StringInfo.GetNextTextElementLength(s)) > 0)
{
Console.WriteLine($"{s.Slice(0, clusterLength)}");
s = s.Slice(clusterLength);
}
Yes, I know about that API. I'm proposing having an enumerator. The old one is unusable for spans.
Why you need a new API then?
Because the old TextElementEnumerator
is allocating and unusable with span. It would be good to have a span-based replacement. Having an enumerator is much more convenient than doing it manually. For the same reason that EnumerateRunes
exists, and Rune.GetRuneAt
wasn't enough, even though it's sufficient for enumeration, because it was recognized that having an enumerator is more convenient.
Tagging subscribers to this area: @dotnet/area-system-globalization See info in area-owners.md if you want to be subscribed.
Author: | Neme12 |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.Globalization`, `needs-area-label` |
Milestone: | Future |
See also #77200. I agree with many of these points:
Currently, it appears that the only way to enumerate over the grapheme clusters is by using methods associated with
StringInfo
, such asStringInfo.GetTextElementEnumerator
andStringInfo.GetNextTextElementLength
.However, given how those methods existed from .NET Framework 1.1, those two methods are less than ideal.
GetTextElementEnumerator
is a non-generic
IEnumerator
, so you can't useforeach
with it / use LINQ etc(It may be possible to implement IEnumerable on the enumerator itself, however: Proposal: Better API for StringInfo.GetTextElementEnumerator #19423 )
Has an ugly
object Current
property (which is justGetTextElement
but inobject
)The only way to access the individual cluster is to use
GetTextElement
, however this returns a newstring
, meaning that using this could result in a newstring
being allocated, in the worst case for every singlechar
(if the original string is made out of only non-combining, non-surrogate characters.)
GetNextTextElementLength
- You are able to obtain lengths of the individual grapheme clusters, so using this you could write your own non-allocating grapheme cluster enumerator, however involves extra steps & is less intuitive to use
And generally, those two methods to work with grapheme clusters are (somewhat) hidden away in System.Globalization namespace, when IMO it really belongs in the String class.
IMHO, there should be a more usable, more friendly and a first-class way of dealing with grapheme clusters in .NET than just some length check and an old, non-generic enumerator hidden away somewhere in System.Globalization in a class that isn't discoverable at all. I feel like this should be a first-class citizen in .NET as much as Rune
s are, and should be exposed as extension methods next to EnumerateRunes
.
And IMHO there should be a lot more than just enumeration - common operations like Substring, IndexOf, etc. that are grapheme cluster bassed, as was mentioned in #77200. Most of the time when developers work with linguistic text they actually need to perform these operations on grapheme clusters and not on char
s. Code that uses char
s works most of the time, but not all the time (especially nowadays when emojis, which can even span multiple Rune
s, are really common). These are really common operations and there is no first-class support for them in .NET. I guess everyone just accepted the reality that most developers won't write the correct code anyway and we can't teach them better ways, char
is good enough, or I don't know :/
There should be a lot more samples that use string operations on text elements than on char
s etc, because that's the correct way to do string operations on actual human-readable text. Yes, developers get this wrong 99% of the time, but can't we at least show them how to do it the right way? And more importantly, provide easy to use APIs to do things the right way?
IMHO even basic samples like https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/strings/ should really be showing Substring etc using grapheme clusters, and not char
s. The fact that there are none and they all use char
-based slices etc leads developers down the wrong path. And the fact that .NET doesn't even provide good APIs for being able to do this the right way is really sad.
It may be worth thinking to have a TextBreaker class which can do different operations (enumeration, searching, ...etc.).
FWIW I don't have a strong opinion on how the team should triage this. If the area owners think it's useful, great!
My earlier comment was just to point out that lacking an official first class API, it would technically be feasible to implement it oneself if needed by relying on existing API surface. But I can definitely see the argument that doing that oneself would involve much boilerplate code and would be annoying.
it would technically be feasible to implement it oneself if needed by relying on existing API surface
I already have my own extension methods and utilities to work with grapheme clusters, but I feel like I shouldn't have to and this is something that should be built-in. Swift's string operations work with grapheme clusters by default - indexing, substring, etc, and they even have a type that represents a grapheme cluster, and simply call it Character
. That's the proper way to do it, not char
like we have :/
I'm thinking EnumerateCharacters
might be a better name than EnumerateTextElements
, because that's what these thing are - characters, human readable characters as they actually appear on screen. "Text element" is kind of vague. Swift calls it Character
. Although that might be confused with char
.
It may be worth thinking to have a TextBreaker class
Yes, there are a lot more operations that are commonly needed other than enumeration - e.g. getting the length in characters, getting a character at a specified index, IndexOf where the index is the index in the number of characters, not in char
s, substring, etc... as was mentioned in #77200. These are common operations and there's no easy way to do them in .NET on actual characters as opposed to code points, or char
s. You almost never want to work with individual char
s, do substrings in chars, etc, unless you're working with ASCII... if you don't want your code to be broken and want your application to be correct with all possible languages and to work well with emojis etc. I'm jealous of Swift on this because they do all this correctly by default. (And .NET doesn't even have a way to do these things, let alone it being the default way of working with string 😔) .NET only has some basic helpers and an outdated .NET Framework 1.1 enumerator hidden somewhere in an unknown class in the Globalization namepace (and I feel like it really shouldn't be there - it has little to do with globalization because the result doesn't depend on the culture. it's just basic Unicode operations. It should be more like in the System.Text namespace).
It may be worth thinking to have a TextBreaker class which can do different operations (enumeration, searching, ...etc.).
I've relied on the ICU UBRK_LINE
iterator for another personal project I'm working on and ended up writing a C# wrapper around ICU's implementation. Having this functionality in an OOB package would be useful.
What's also missing is a UTF-8 overload of StringInfo.GetNextTextElementLength
.
I'm thinking
EnumerateCharacters
might be a better name thanEnumerateTextElements
, because that's what these thing are - characters, human readable characters as they actually appear on screen. "Text element" is kind of vague. Swift calls itCharacter
. Although that might be confused withchar
.
Another option might be EnumerateGraphemes
. While Unicode calls them grapheme clusters, they are really just graphemes. This would be the most descriptive & precise name. I dislike "text elements" because it's really vague - even a whole word or a sentence could be considered text elements. And AFAIK, it's not a term that any other platforms use either, just an odd name invented by .NET.
There is a similar issue to this one: #19423
I just very much missed the UTF-8 GetNextTextElementLength overload, especially since all the necessary pieces are there, but oh so slightly out of reach between being internal
and unusable with reflection due to ReadOnlySpan
.
The current TextElementEnumerator
is useless.
It does not accept ReadOnlySpan<char>
, and yields object
, not even string
.
It must have yielded ReadOnlySpan<char>
to avoid unnecessary heap allocations instead ideally.
GetNextTextElementLength
is just a tool to create such a new iterator by yourself, and missing the overload (ReadOnlySpan<char>, int)
.
I just very much missed the UTF-8 GetNextTextElementLength overload, especially since all the necessary pieces are there, but oh so slightly out of reach between being
internal
and unusable with reflection due toReadOnlySpan
.
Yeah, I needed that too. The implementation is written generically over a Rune decoder so adding it would just be a one line implementation. But it's not exposed :( But I also needed a UTF-32/Rune version. If only there was an API that specifically takes a ReadOnlySpan<T>
and a decoder delegate like the internal one.
Background and motivation
There's an existing API,
StringInfo.GetTextElementEnumerator
, which allows us to enumerate textual elements (grapheme clusters, or in other words, the individual characters that actually get printed on screen, as people usually think of them, and which can potentially consist of multipleRune
s), but this enumerator returnsstring
instances as opposed to a span of the original string, adding unnecessary copying and allocations that could be avoided. It's also cumbersome because it's a non-generic enumerator, and doesn't even have aGetEnumerator()
method, so it cannot be used in a foreach loop.This is the existing API:
API Proposal
I'm proposing adding a modern replacement for this API that uses a span. I also propose adding it next to the existing
EnumerateRunes
methods to make it more discoverable, and let people think about which to choose when they see these two methods next to each other.API Usage
Alternative Designs
No response
Risks
No response