dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.38k stars 4.75k forks source link

IUtf8SpanFormattable and IUtf8SpanParsable #81500

Open tannergooding opened 1 year ago

tannergooding commented 1 year ago

Background and Motivation

We currently support UTF-16 based formatting and parsing and even expose common interfaces through which any developer can declare their types as supporting the same. However, we have no such support for the same around UTF-8.

With UTF-8 being ever more prevalent for various scenarios, it would be ideal if similar interfaces could be exposed so users can express that their own types support the functionality.

As such, I propose we expose two new interfaces that support parsing/formatting types using UTF-8. These interfaces would only support Span today and as we do not have a corresponding Utf8String type that would make exposing IUtf8Formattable or IUtf8Parsable viable today. We could express those as byte[], but that is "less ideal" and blocks us from supporting any future utf8 string type.

Proposed API

namespace System;

public interface IUtf8SpanFormattable
{
    bool TryFormat(Span<byte> destination, out int bytesWritten, ReadOnlySpan<byte> format, IFormatProvider? provider);
}

public interface IUtf8SpanParsable<TSelf>
    where TSelf : IUtf8SpanParsable<TSelf>?
{
    static abstract TSelf Parse(ReadOnlySpan<byte> s, IFormatProvider? provider);

    static abstract bool TryParse(ReadOnlySpan<byte> s, IFormatProvider? provider, [MaybeNullWhen(returnValue: false)] out TSelf result);
}

Initial types that will implement the interface

namespace System
{
    public partial struct Byte : IUtf8SpanFormattable, IUtf8SpanParsable<byte>;
    public partial struct Char : IUtf8SpanFormattable, IUtf8SpanParsable<char>;
    public partial struct Decimal : IUtf8SpanFormattable, IUtf8SpanParsable<decimal>;
    public partial struct Double : IUtf8SpanFormattable, IUtf8SpanParsable<double>;
    public partial struct Half : IUtf8SpanFormattable, IUtf8SpanParsable<Half>;
    public partial struct Int16 : IUtf8SpanFormattable, IUtf8SpanParsable<short>;
    public partial struct Int32 : IUtf8SpanFormattable, IUtf8SpanParsable<int>;
    public partial struct Int64 : IUtf8SpanFormattable, IUtf8SpanParsable<long>;
    public partial struct Int128 : IUtf8SpanFormattable, IUtf8SpanParsable<Int128>;
    public partial struct IntPtr : IUtf8SpanFormattable, IUtf8SpanParsable<nint>;
    public partial struct SByte : IUtf8SpanFormattable, IUtf8SpanParsable<sbyte>;
    public partial struct Single : IUtf8SpanFormattable, IUtf8SpanParsable<float>;
    public partial struct UInt16 : IUtf8SpanFormattable, IUtf8SpanParsable<ushort>;
    public partial struct UInt32 : IUtf8SpanFormattable, IUtf8SpanParsable<uint>;
    public partial struct UInt64 : IUtf8SpanFormattable, IUtf8SpanParsable<ulong>;
    public partial struct UInt128 : IUtf8SpanFormattable, IUtf8SpanParsable<UInt128>;
    public partial struct UIntPtr : IUtf8SpanFormattable, IUtf8SpanParsable<nuint>;

    public partial struct DateOnly : IUtf8SpanFormattable, IUtf8SpanParsable<DateOnly>;
    public partial struct DateTime : IUtf8SpanFormattable, IUtf8SpanParsable<DateTime>;
    public partial struct DateTimeOffset : IUtf8SpanFormattable, IUtf8SpanParsable<DateTimeOffset>;
    public partial struct Guid : IUtf8SpanFormattable, IUtf8SpanParsable<Guid>;
    public partial struct TimeOnly : IUtf8SpanFormattable, IUtf8SpanParsable<TimeOnly>;
    public partial struct TimeSpan : IUtf8SpanFormattable, IUtf8SpanParsable<TimeSpan>;
}

namespace System.Numerics
{
    public partial struct Complex : IUtf8SpanFormattable, IUtf8SpanParsable<Complex>;
    public partial struct BigInteger : IUtf8SpanFormattable, IUtf8SpanParsable<BigInteger>;
}

namespace System.Runtime.InteropServices
{
    public partial struct NFloat : IUtf8SpanFormattable, IUtf8SpanParsable<NFloat>;
}

System.Enum, System.Rune, and System.Version all implement ISpanFormattable today. They could optionally implement IUtf8SpanFormattable as well.

We should ideally have System.Numerics.INumberBase<TSelf> implement both IUtf8SpanFormattable and IUtf8SpanParsable<TSelf>. Doing this would require a DIM that defers to the UTF-16 variant.

Additional Considerations

It may be desirable to provide some API that lets users know the longest potential format string so they can have a "fail safe" way of formatting their value. For many types this is a well-defined upper bound or can be trivially computed.

These APIs operate like ISpanFormattable and ISpanParsable and not like Utf8Formatter or Utf8Parser. That is, they fail if they encounter unrecognized or unsupported data where-as the latter instead treat it as effectively "end of data to parse". There are both pros and cons to this approach, but I believe that the latter's functionality is better expressed via a different API and one that could also apply to UTF-16.

This doesn't account for number parsing which would likely entail extending INumberBase<TSelf> with new UTF-8 APIs as well. If we expose such APIs, we'd also extend INumberBase<TSelf with the following methods (which would be DIM and defer to the UTF-16 variants):

static virtual TSelf Parse(ReadOnlySpan<byte> s, NumberStyles style, IFormatProvider? provider);
static virtual bool TryParse(ReadOnlySpan<byte> s, NumberStyles style, IFormatProvider? provider, [MaybeNullWhen(false)] out TSelf result);

Should we take ReadOnlySpan<byte> format or string format. There are pros/cons to each approach.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-memory See info in area-owners.md if you want to be subscribed.

Issue Details
## Background and Motivation We currently support UTF-16 based formatting and parsing and even expose common interfaces through which any developer can declare their types as supporting the same. However, we have no such support for the same around UTF-8. With UTF-8 being ever more prevalent for various scenarios, it would be ideal if similar interfaces could be exposed so users can express that their own types support the functionality. As such, I propose we expose two new interfaces that support parsing/formatting types using UTF-8. These interfaces would only support `Span` today and as we do not have a corresponding `Utf8String` type that would make exposing `IUtf8Formattable` or `IUtf8Parsable` viable today. We could express those as `byte[]`, but that is "less ideal" and blocks us from supporting any future utf8 string type. ## Proposed API ```csharp namespace System; public interface IUtf8SpanFormattable : IUtf8Formattable { bool TryFormat(Span destination, out int bytesWritten, ReadOnlySpan format, IFormatProvider? provider); } public interface IUtf8SpanParsable : IUtf8Parsable where TSelf : ISpanParsable? { static abstract TSelf Parse(ReadOnlySpan s, IFormatProvider? provider); static abstract bool TryParse(ReadOnlySpan s, IFormatProvider? provider, [MaybeNullWhen(returnValue: false)] out TSelf result); } ``` ## Additional Considerations It may be desirable to provide some API that lets users know the longest potential format string so they can have a "fail safe" way of formatting their value. For many types this is a well-defined upper bound or can be trivially computed. These APIs operate like `ISpanFormattable` and `ISpanParsable` and not like `Utf8Formatter` or `Utf8Parser`. That is, they fail if they encounter unrecognized or unsupported data where-as the latter instead treat it as effectively "end of data to parse". There are both pros and cons to this approach, but I believe that the latter's functionality is better expressed via a different API and one that could also apply to UTF-16.
Author: tannergooding
Assignees: -
Labels: `area-System.Memory`, `untriaged`
Milestone: -
ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.

Issue Details
## Background and Motivation We currently support UTF-16 based formatting and parsing and even expose common interfaces through which any developer can declare their types as supporting the same. However, we have no such support for the same around UTF-8. With UTF-8 being ever more prevalent for various scenarios, it would be ideal if similar interfaces could be exposed so users can express that their own types support the functionality. As such, I propose we expose two new interfaces that support parsing/formatting types using UTF-8. These interfaces would only support `Span` today and as we do not have a corresponding `Utf8String` type that would make exposing `IUtf8Formattable` or `IUtf8Parsable` viable today. We could express those as `byte[]`, but that is "less ideal" and blocks us from supporting any future utf8 string type. ## Proposed API ```csharp namespace System; public interface IUtf8SpanFormattable : IUtf8Formattable { bool TryFormat(Span destination, out int bytesWritten, ReadOnlySpan format, IFormatProvider? provider); } public interface IUtf8SpanParsable : IUtf8Parsable where TSelf : ISpanParsable? { static abstract TSelf Parse(ReadOnlySpan s, IFormatProvider? provider); static abstract bool TryParse(ReadOnlySpan s, IFormatProvider? provider, [MaybeNullWhen(returnValue: false)] out TSelf result); } ``` ## Additional Considerations It may be desirable to provide some API that lets users know the longest potential format string so they can have a "fail safe" way of formatting their value. For many types this is a well-defined upper bound or can be trivially computed. These APIs operate like `ISpanFormattable` and `ISpanParsable` and not like `Utf8Formatter` or `Utf8Parser`. That is, they fail if they encounter unrecognized or unsupported data where-as the latter instead treat it as effectively "end of data to parse". There are both pros and cons to this approach, but I believe that the latter's functionality is better expressed via a different API and one that could also apply to UTF-16.
Author: tannergooding
Assignees: -
Labels: `api-suggestion`, `area-System.Runtime`
Milestone: -
KTSnowy commented 1 year ago

Hi @tannergooding, would the new Decimal128 type from #81376 be able to support this API?

You mentioned that this API proposal doesn't account for number parsing, so I'm assuming that more work is needed for this to be compatible with the new decimal types, right?

Is there anything I can do to help with this?

tannergooding commented 1 year ago

It specifically doesn't account for the overloads that take NumberFormat, those would be a separate consideration we make as part of this PR review or in a separate one.

terrajobst commented 1 year ago

Video

namespace System;

public interface IUtf8SpanFormattable
{
    bool TryFormat(Span<byte> utf8Destination, out int bytesWritten, ReadOnlySpan<char> format, IFormatProvider? provider);
}

public interface IUtf8SpanParsable<TSelf>
    where TSelf : IUtf8SpanParsable<TSelf>?
{
    static abstract TSelf Parse(ReadOnlySpan<byte> utf8, IFormatProvider? provider);
    static abstract bool TryParse(ReadOnlySpan<byte> utf8, IFormatProvider? provider, [MaybeNullWhen(returnValue: false)] out TSelf result);
}
namespace System.Numerics;

public interface INumberBase<TSelf>
{
    static virtual TSelf Parse(ReadOnlySpan<byte> utf8Text, NumberStyles style, IFormatProvider? provider);
    static virtual bool TryParse(ReadOnlySpan<byte> utf8Text, NumberStyles style, IFormatProvider? provider, [MaybeNullWhen(false)] out TSelf result);
}
Sergio0694 commented 1 year ago

Is the parameter name here meant to just be "utf8" or shouldn't it be "utf8Text" like in INumberBase<TSelf>? 🤔

public interface IUtf8SpanParsable<TSelf>
    where TSelf : IUtf8SpanParsable<TSelf>?
{
    static abstract TSelf Parse(ReadOnlySpan<byte> utf8, IFormatProvider? provider);
    static abstract bool TryParse(ReadOnlySpan<byte> utf8, IFormatProvider? provider, [MaybeNullWhen(returnValue: false)] out TSelf result);
}
davidfowl commented 1 year ago

Does this make Utf8Formatter and Utf8Parser obsolete?

cc @DamianEdwards

stephentoub commented 1 year ago

I expect there will be little need for Utf8Formatter.

Utf8Parser diverged from the standard number parsing behavior. When it encounters something that's not part of the number and stops parsing, it returns what it has so far rather than failing. Analogous to StartsWith rather than Equals. That behavior is sometimes what you want, so it still has use, and I expect at some point we'll want to actually add the char equivalent, though I think it more likely we'd do so via NumberStyles so that it's integrated with generic math... at which point Utf8Parser would also no longer have much value.

mellinoe commented 1 year ago

Suggestion: include the System.Numerics vector, matrix, and quaternion types as well. From a gamedev perspective it would be very nice to format these types directly to a Utf8 buffer without allocating.

tannergooding commented 1 year ago

We'll plan on expanding the list of types as necessary. We're just looking at covering the most core types in the first pass.

Please feel free to open API proposals for other types as appropriate.

stephentoub commented 1 year ago

Implementation progress...

Interfaces:

IUtf8SpanFormattable implementations:

IUtf8SpanParsable implementations:

stephentoub commented 1 year ago

This issue covers adding UTF8 to things already implementing ISpanFormattable. Please open separate issues for other types. https://github.com/dotnet/runtime/issues/83201 exists for PhysicalAddress.

bartonjs commented 1 year ago

Video

This came up in review to discuss whether the types implementing this interface should implement the methods implicitly (public on the type) or explicitly (requiring the interface cast/coercion).

The answer was "match what we did for ISpanParsable/ISpanFormattable", which seems to be implicit everywhere except System.Char (explicit there).

tannergooding commented 1 year ago

We landed IUtf8SpanFormattable and IUtf8Parsable as well as implementing both on the primitive numeric types.

There are some types that didn't get support for both interfaces which we'll hopefully land early in .NET 9

tannergooding commented 3 months ago

Pending work here

IUtf8SpanFormattable implementations:

IUtf8SpanParsable implementations: