Open Neme12 opened 1 year ago
Tagging subscribers to this area: @dotnet/area-system-memory See info in area-owners.md if you want to be subscribed.
Author: | Neme12 |
---|---|
Assignees: | - |
Labels: | `area-System.Memory`, `untriaged` |
Milestone: | - |
I mentioned this in Discord chat the other day, but it would be neat if we had a feature like this:
public interface ISpanStringable<T> {
static string ToString(ReadOnlySpan<T> buffer);
}
public readonly struct char : ISpanStringable<char> { /* ... */ }
public readonly struct Rune : ISpanStringable<Rune> { /* ... */ }
public ref struct Span<T> {
public override string ToString() {
if (T is ISpanStringable<T>) { return T.ToString(this); }
else { /* fallback logic */ }
}
}
Then we could even get rid of Span's existing char specialization within ToString, and any type can add its own specialization of this.
That said, I do question the value of this particular specialization, since I expect it to be exceedingly rare to have a contiguous collection of Rune
instances. I'm struggling to think of a use case for this.
I love that idea.
In my case, I'm representing a Grapheme
as a sequence of Rune
s and therefore work with ReadOnlySpan<Rune>
regularly.
I love that idea.
If only it could also work for UTF-8 though. 😔 It's such a shame that Utf8Char was never added and we have to forever live with such a bad debugging experience. 😔
(first and foremost, apologies for the plug, really, but it's relevant work)
If only it could also work for UTF-8 though.
https://github.com/U8String/U8String/blob/main/src/U8Enumerators.cs#L203 + https://github.com/U8String/U8String/blob/main/src/Shared/U8Conversions.cs
Generally speaking, any improvements to the current state of .NET APIs for working with text are very welcome. In this particular case, I think the better option would be introducing Unicode scalars to UTF-16 conversion on something like
namespace System.Text.Unicode;
public static partial class Utf16
{
OperationStatus FromUnicode(ReadOnlySpan<byte> source, Span<char> destination, out int bytesRead, out int charsWritten);
OperationStatus FromUnicode(ReadOnlySpan<Rune> source, Span<char> destination, out int runesRead, out int charsWritten);
}
and then subsequent methods which augment this for a more convenient usage. Unlike ROS<char>
-> string
conversions, decoding Rune
s to string
s is not a common operation so special-casing ROS<Rune>.ToString()
seems rather counterintuitive.
While such API are being discussed, you can already convert Span<Rune>
to string
with the following snippet instead:
var text = "hello world";
var runes = text.EnumerateRunes().ToArray();
var decoded = Encoding.UTF32.GetString(MemoryMarshal.Cast<Rune, byte>(runes));
Assert.Equal(text, decoded);
@neon-sunset Sorry, I don't understand how what you're saying is related to this proposal.
While such API are being discussed, you can already convert Span
to string with the following snippet instead:
I know I can manually convert it to string, but it helps with debugging to be able to see it immediately, just like it's really useful with ReadOnlySpan<char>
.
I mentioned this in Discord chat the other day, but it would be neat if we had a feature like this:
It's almost tempting to call that interface ISpanFormattable
😄
var decoded = Encoding.UTF32.GetString(MemoryMarshal.Cast<Rune, byte>(runes));
This will result in undefined behavior. The bit pattern which backs a Rune
instance is not guaranteed to be a little-endian UTF-32 code point.
I just do a quick string.Concat(runes.ToArray())
when debugging. There are more efficient ways of course. Although it would be nice if there was an easier way to do this efficiently.
This will result in undefined behavior. The bit pattern which backs a Rune instance is not guaranteed to be a little-endian UTF-32 code point.
Yes, I realize that this is pretty much a workaround, but the probability that such code will run on s390x is exceedingly small. For library code, it will get guarded behind a check for endianness.
Yes, but the probability that such code will run on s390x is exceedingly small. For library code, it will get guarded behind a check for endianness.
The point is that Rune
currently happening to have a uint
field that happens to represent a UTF-32 code point is an implementation detail that is allowed to change.
Yes, I realize that this is pretty much a workaround, but the probability that such code will run on s390x is exceedingly small. For library code, it will get guarded behind a check for endianness.
You're assuming that Rune
will always consist of only one int-sized backing field, and that the backing field will always be a machine-endian representation of the scalar value with no reserved bits or other twiddling. This is not guaranteed behavior. By relying on this, you're relying on an unsupported implementation detail.
You're assuming that Rune will always consist of only one int-sized backing field, and that the backing field will always be a machine-endian representation of the scalar value with no reserved bits or other twiddling. This is not guaranteed behavior. By relying on this, you're relying on an unsupported implementation detail.
My comments did not imply this is a good solution, just a one that works for now. If it sounds otherwise, I apologize. If Rune
and other APIs for working with Unicode change and improve, I'm all for it :)
You're assuming that Rune will always consist of only one int-sized backing field, and that the backing field will always be a machine-endian representation of the scalar value with no reserved bits or other twiddling.
Couldn't it be changed though to guarantee this? That would be useful for people doing interop with UTF-32 as they could always use ReadOnlySpan<Rune>
. After all, we do have such blittability guarantees for char
, so why not for Rune
.
Couldn't it be changed though to guarantee this? That would be useful for people doing interop with UTF-32 as they could always use
ReadOnlySpan<Rune>
.
We can always make whatever guarantees we want. :) But this would stop us from making other optimizations, such as squirreling away the UTF-8 length in the top 2 bits of the struct value (that would certainly simplify getting the UTF-8 byte count!), having supplementary-plane runes represented behind the scenes as (first_char << 16) | last_char
, which would make ToString
much faster, etc.
After all, we do have such blittability guarantees for
char
, so why not forRune
.
char
is a fundamental data type and interop was built in to its original design. That's not the case for arbitrary structs like Rune
.
Currently,
ReadOnlySpan<char>.ToString
has a special-cased implementation that simply returns the string. It would be nice if this worked for spans ofRune
as well.