Provide an implementation of `ReadOnlySpan<Rune>.ToString`

Neme12 commented 1 year ago

Currently, ReadOnlySpan<char>.ToString has a special-cased implementation that simply returns the string. It would be nice if this worked for spans of Rune as well.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-memory See info in area-owners.md if you want to be subscribed.

Issue Details

Currently, `ReadOnlySpan.ToString` has a special-cased implementation that simply returns the string. It would be nice if this worked for spans of `Rune` as well.

Author:	Neme12
Assignees:	-
Labels:	`area-System.Memory`, `untriaged`
Milestone:	-

GrabYourPitchforks commented 1 year ago

I mentioned this in Discord chat the other day, but it would be neat if we had a feature like this:

public interface ISpanStringable<T> {
    static string ToString(ReadOnlySpan<T> buffer);
}

public readonly struct char : ISpanStringable<char> { /* ... */ }
public readonly struct Rune : ISpanStringable<Rune> { /* ... */ }

public ref struct Span<T> {
    public override string ToString() {
        if (T is ISpanStringable<T>) { return T.ToString(this); }
        else { /* fallback logic */ }
    }
}

Then we could even get rid of Span's existing char specialization within ToString, and any type can add its own specialization of this.

GrabYourPitchforks commented 1 year ago

That said, I do question the value of this particular specialization, since I expect it to be exceedingly rare to have a contiguous collection of Rune instances. I'm struggling to think of a use case for this.

Neme12 commented 1 year ago

I love that idea.

Neme12 commented 1 year ago

In my case, I'm representing a Grapheme as a sequence of Runes and therefore work with ReadOnlySpan<Rune> regularly.

Neme12 commented 1 year ago

I love that idea.

If only it could also work for UTF-8 though. 😔 It's such a shame that Utf8Char was never added and we have to forever live with such a bad debugging experience. 😔

neon-sunset commented 1 year ago

(first and foremost, apologies for the plug, really, but it's relevant work)

If only it could also work for UTF-8 though.

https://github.com/U8String/U8String/blob/main/src/U8Enumerators.cs#L203 + https://github.com/U8String/U8String/blob/main/src/Shared/U8Conversions.cs

Generally speaking, any improvements to the current state of .NET APIs for working with text are very welcome. In this particular case, I think the better option would be introducing Unicode scalars to UTF-16 conversion on something like

namespace System.Text.Unicode;

public static partial class Utf16
{
    OperationStatus FromUnicode(ReadOnlySpan<byte> source, Span<char> destination, out int bytesRead, out int charsWritten);

    OperationStatus FromUnicode(ReadOnlySpan<Rune> source, Span<char> destination, out int runesRead, out int charsWritten);
}

and then subsequent methods which augment this for a more convenient usage. Unlike ROS<char> -> string conversions, decoding Runes to strings is not a common operation so special-casing ROS<Rune>.ToString() seems rather counterintuitive.

While such API are being discussed, you can already convert Span<Rune> to string with the following snippet instead:

var text = "hello world";
var runes = text.EnumerateRunes().ToArray();
var decoded = Encoding.UTF32.GetString(MemoryMarshal.Cast<Rune, byte>(runes));

Assert.Equal(text, decoded);

Neme12 commented 1 year ago

@neon-sunset Sorry, I don't understand how what you're saying is related to this proposal.

While such API are being discussed, you can already convert Span to string with the following snippet instead:

I know I can manually convert it to string, but it helps with debugging to be able to see it immediately, just like it's really useful with ReadOnlySpan<char>.

Neme12 commented 1 year ago

I mentioned this in Discord chat the other day, but it would be neat if we had a feature like this:

It's almost tempting to call that interface ISpanFormattable 😄

GrabYourPitchforks commented 1 year ago

var decoded = Encoding.UTF32.GetString(MemoryMarshal.Cast<Rune, byte>(runes));

This will result in undefined behavior. The bit pattern which backs a Rune instance is not guaranteed to be a little-endian UTF-32 code point.

Neme12 commented 1 year ago

I just do a quick string.Concat(runes.ToArray()) when debugging. There are more efficient ways of course. Although it would be nice if there was an easier way to do this efficiently.

neon-sunset commented 1 year ago

This will result in undefined behavior. The bit pattern which backs a Rune instance is not guaranteed to be a little-endian UTF-32 code point.

Yes, I realize that this is pretty much a workaround, but the probability that such code will run on s390x is exceedingly small. For library code, it will get guarded behind a check for endianness.

alexrp commented 1 year ago

Yes, but the probability that such code will run on s390x is exceedingly small. For library code, it will get guarded behind a check for endianness.

The point is that Rune currently happening to have a uint field that happens to represent a UTF-32 code point is an implementation detail that is allowed to change.

GrabYourPitchforks commented 1 year ago

Yes, I realize that this is pretty much a workaround, but the probability that such code will run on s390x is exceedingly small. For library code, it will get guarded behind a check for endianness.

You're assuming that Rune will always consist of only one int-sized backing field, and that the backing field will always be a machine-endian representation of the scalar value with no reserved bits or other twiddling. This is not guaranteed behavior. By relying on this, you're relying on an unsupported implementation detail.

neon-sunset commented 1 year ago

You're assuming that Rune will always consist of only one int-sized backing field, and that the backing field will always be a machine-endian representation of the scalar value with no reserved bits or other twiddling. This is not guaranteed behavior. By relying on this, you're relying on an unsupported implementation detail.

My comments did not imply this is a good solution, just a one that works for now. If it sounds otherwise, I apologize. If Rune and other APIs for working with Unicode change and improve, I'm all for it :)

Neme12 commented 1 year ago

You're assuming that Rune will always consist of only one int-sized backing field, and that the backing field will always be a machine-endian representation of the scalar value with no reserved bits or other twiddling.

Couldn't it be changed though to guarantee this? That would be useful for people doing interop with UTF-32 as they could always use ReadOnlySpan<Rune>. After all, we do have such blittability guarantees for char, so why not for Rune.

GrabYourPitchforks commented 1 year ago

Couldn't it be changed though to guarantee this? That would be useful for people doing interop with UTF-32 as they could always use ReadOnlySpan<Rune>.

We can always make whatever guarantees we want. :) But this would stop us from making other optimizations, such as squirreling away the UTF-8 length in the top 2 bits of the struct value (that would certainly simplify getting the UTF-8 byte count!), having supplementary-plane runes represented behind the scenes as (first_char << 16) | last_char, which would make ToString much faster, etc.

After all, we do have such blittability guarantees for char, so why not for Rune.

char is a fundamental data type and interop was built in to its original design. That's not the case for arbitrary structs like Rune.

dotnet / runtime

Provide an implementation of `ReadOnlySpan<Rune>.ToString` #91426