dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.66k stars 4.58k forks source link

Proposal: Create OOB package for Rune, System.Text.Unicode.Utf8, and related APIs #52947

Open GrabYourPitchforks opened 3 years ago

GrabYourPitchforks commented 3 years ago

Problem summary

While working on the System.Text.Encodings.Web refactoring I noticed that we have several duplicates of the System.Text.Rune type (or its backing logic) throughout our projects.

The official implementation exposed by System.Runtime:

A copy used by the Utf8String OOB:

Multiple copies used by System.Text.Encodings.Web:

Duplicated logic in System.Text.Json:

System.Text.Json also has some duplicated logic from what turned into the System.Text.Unicode.Utf8 helper APIs:

Proposal

I propose to create a standalone Microsoft.Bcl.Unicode package which includes the functionality of the Rune type plus select helper methods. The exposed API surface would be as follows.

/*
 * These APIs are type-forwarded into System.Runtime or System.Memory on .NET Core 3.1+.
 */

namespace System.Text
{
    //
    // This is the public API surface of Rune as it existed in .NET Core 3.1.
    //
    public readonly partial struct Rune : System.IComparable<System.Text.Rune>, System.IEquatable<System.Text.Rune>
    {
        private readonly int _dummyPrimitive;
        public Rune(char ch) { throw null; }
        public Rune(char highSurrogate, char lowSurrogate) { throw null; }
        public Rune(int value) { throw null; }
        [System.CLSCompliantAttribute(false)]
        public Rune(uint value) { throw null; }
        public bool IsAscii { get { throw null; } }
        public bool IsBmp { get { throw null; } }
        public int Plane { get { throw null; } }
        public static System.Text.Rune ReplacementChar { get { throw null; } }
        public int Utf16SequenceLength { get { throw null; } }
        public int Utf8SequenceLength { get { throw null; } }
        public int Value { get { throw null; } }
        public int CompareTo(System.Text.Rune other) { throw null; }
        public static System.Buffers.OperationStatus DecodeFromUtf16(System.ReadOnlySpan<char> source, out System.Text.Rune result, out int charsConsumed) { throw null; }
        public static System.Buffers.OperationStatus DecodeFromUtf8(System.ReadOnlySpan<byte> source, out System.Text.Rune result, out int bytesConsumed) { throw null; }
        public static System.Buffers.OperationStatus DecodeLastFromUtf16(System.ReadOnlySpan<char> source, out System.Text.Rune result, out int charsConsumed) { throw null; }
        public static System.Buffers.OperationStatus DecodeLastFromUtf8(System.ReadOnlySpan<byte> source, out System.Text.Rune value, out int bytesConsumed) { throw null; }
        public int EncodeToUtf16(System.Span<char> destination) { throw null; }
        public int EncodeToUtf8(System.Span<byte> destination) { throw null; }
        public override bool Equals(object? obj) { throw null; }
        public bool Equals(System.Text.Rune other) { throw null; }
        public override int GetHashCode() { throw null; }
        public static double GetNumericValue(System.Text.Rune value) { throw null; }
        public static System.Text.Rune GetRuneAt(string input, int index) { throw null; }
        public static System.Globalization.UnicodeCategory GetUnicodeCategory(System.Text.Rune value) { throw null; }
        public static bool IsControl(System.Text.Rune value) { throw null; }
        public static bool IsDigit(System.Text.Rune value) { throw null; }
        public static bool IsLetter(System.Text.Rune value) { throw null; }
        public static bool IsLetterOrDigit(System.Text.Rune value) { throw null; }
        public static bool IsLower(System.Text.Rune value) { throw null; }
        public static bool IsNumber(System.Text.Rune value) { throw null; }
        public static bool IsPunctuation(System.Text.Rune value) { throw null; }
        public static bool IsSeparator(System.Text.Rune value) { throw null; }
        public static bool IsSymbol(System.Text.Rune value) { throw null; }
        public static bool IsUpper(System.Text.Rune value) { throw null; }
        public static bool IsValid(int value) { throw null; }
        [System.CLSCompliantAttribute(false)]
        public static bool IsValid(uint value) { throw null; }
        public static bool IsWhiteSpace(System.Text.Rune value) { throw null; }
        public static bool operator ==(System.Text.Rune left, System.Text.Rune right) { throw null; }
        public static explicit operator System.Text.Rune (char ch) { throw null; }
        public static explicit operator System.Text.Rune (int value) { throw null; }
        [System.CLSCompliantAttribute(false)]
        public static explicit operator System.Text.Rune (uint value) { throw null; }
        public static bool operator >(System.Text.Rune left, System.Text.Rune right) { throw null; }
        public static bool operator >=(System.Text.Rune left, System.Text.Rune right) { throw null; }
        public static bool operator !=(System.Text.Rune left, System.Text.Rune right) { throw null; }
        public static bool operator <(System.Text.Rune left, System.Text.Rune right) { throw null; }
        public static bool operator <=(System.Text.Rune left, System.Text.Rune right) { throw null; }
        public static System.Text.Rune ToLower(System.Text.Rune value, System.Globalization.CultureInfo culture) { throw null; }
        public static System.Text.Rune ToLowerInvariant(System.Text.Rune value) { throw null; }
        public override string ToString() { throw null; }
        public static System.Text.Rune ToUpper(System.Text.Rune value, System.Globalization.CultureInfo culture) { throw null; }
        public static System.Text.Rune ToUpperInvariant(System.Text.Rune value) { throw null; }
        public static bool TryCreate(char highSurrogate, char lowSurrogate, out System.Text.Rune result) { throw null; }
        public static bool TryCreate(char ch, out System.Text.Rune result) { throw null; }
        public static bool TryCreate(int value, out System.Text.Rune result) { throw null; }
        [System.CLSCompliantAttribute(false)]
        public static bool TryCreate(uint value, out System.Text.Rune result) { throw null; }
        public bool TryEncodeToUtf16(System.Span<char> destination, out int charsWritten) { throw null; }
        public bool TryEncodeToUtf8(System.Span<byte> destination, out int bytesWritten) { throw null; }
        public static bool TryGetRuneAt(string input, int index, out System.Text.Rune value) { throw null; }
    }

    public ref partial struct SpanRuneEnumerator
    {
        private object _dummy;
        private int _dummyPrimitive;
        public System.Text.Rune Current { get { throw null; } }
        public System.Text.SpanRuneEnumerator GetEnumerator() { throw null; }
        public bool MoveNext() { throw null; }
    }

    public partial struct StringRuneEnumerator : System.Collections.Generic.IEnumerable<System.Text.Rune>, System.Collections.Generic.IEnumerator<System.Text.Rune>, System.Collections.IEnumerable, System.Collections.IEnumerator, System.IDisposable
    {
        private object _dummy;
        private int _dummyPrimitive;
        public System.Text.Rune Current { get { throw null; } }
        object? System.Collections.IEnumerator.Current { get { throw null; } }
        public System.Text.StringRuneEnumerator GetEnumerator() { throw null; }
        public bool MoveNext() { throw null; }
        System.Collections.Generic.IEnumerator<System.Text.Rune> System.Collections.Generic.IEnumerable<System.Text.Rune>.GetEnumerator() { throw null; }
        System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() { throw null; }
        void System.Collections.IEnumerator.Reset() { }
        void System.IDisposable.Dispose() { }
    }
}

namespace System.Text.Unicode
{
    public static partial class Utf8
    {
        public static System.Buffers.OperationStatus FromUtf16(System.ReadOnlySpan<char> source, System.Span<byte> destination, out int charsRead, out int bytesWritten, bool replaceInvalidSequences = true, bool isFinalBlock = true) { throw null; }
        public static System.Buffers.OperationStatus ToUtf16(System.ReadOnlySpan<byte> source, System.Span<char> destination, out int bytesRead, out int charsWritten, bool replaceInvalidSequences = true, bool isFinalBlock = true) { throw null; }
    }
}

/*
 * These APIs are extension methods to mimic instance APIs added in .NET Core 3.1.
 * The extension methods cannot be type-forwarded, but their 3.1+ implementations can act as
 * thin wrappers around the real framework APIs. Downlevel implementations will have the
 * logic fully contained within these methods.
 */

namespace System
{
    public partial static class MemoryExtensions
    {
        public static System.Text.SpanRuneEnumerator EnumerateRunes(this System.ReadOnlySpan<char> span) { throw null; }
        public static System.Text.SpanRuneEnumerator EnumerateRunes(this System.Span<char> span) { throw null; }
    }

    public partial static class StringExtensions
    {
        public static System.Text.StringRuneEnumerator EnumerateRunes(this string text) { throw null; }
    }
}

The package should target the frameworks net461, netstandard2.0, and netcoreapp3.1. When running on .NET Core 3.1+, the implementations are largely removed and the package type-forwards into the SDK.

Consumers like System.Text.Json and System.Text.Encodings.Web do not need to include references to this package as part of their ref set. However, when compiling for targets before .NET Core 3.1, the implementation assemblies would have a reference to this pacakge.

Q\&A

Can we use an existing package instead of creating a new Microsoft.Bcl.Unicode package? Probably, if that's desired. If we wanted to use an existing package, System.Runtime.Extensions and System.Text.Encoding.Extensions would probably be the obvious candidates. But these assemblies have existing sometimes-OOB-sometimes-inbox behavior, which also affects things like versioning. As part of the SDK, there may be implicit references to the APIs contained within these packages, which could cause problems for MemoryExtensions and StringExtensions (see below). In my ideal world, Microsoft.Bcl.Unicode would be a fully-OOB package that would never be included in the inbox SDK, as it's quite literally a polyfill / point-in-time release.

If we did add these APIs to System.Text.Encoding.Extensions or System.Runtime.Extensions, we would need to add a reference from those packages to System.Memory, which contains OperationStatus and [ReadOnly]Span<T>, since these are exposed via the public API surface.

Are MemoryExtensions and StringExtensions a good idea? The proposal above introduces them into the System namespace so that they look like the shipped .NET Core 3.1+ APIs. The type name MemoryExtensions is clearly already taken, but since most callers should invoke the APIs via extension method syntax rather than typical static method invocation syntax, I don't believe this will cause conflicts in practice. The only time it could cause conflicts is if a caller is targeting .NET Core 3.1+ and also has referenced this package explicitly.

If desired, we can also [Obsolete] the APIs in the .NET Core 3.1+ ref, which signals to library authors that they should only be pulling in this package when compiling for downlevel targets. If they're cross-compiling for multiple targets, they should configure their environment not to include this package reference in later APIs. We can even delete the APIs from the ref but leave them in the implementation, which forbids compiling against them but won't block existing compiled code from binding against it during load.

What about new APIs being introduced? For example, there may be desire for downlevel customers to call APIs introduced in https://github.com/dotnet/runtime/issues/28230, which is slated for introduction in the 6.0 timeframe. We can take these on a case-by-case basis; but generally speaking, there's nothing stopping us from utilizing the same techniques here. Implement the APIs to the best of our ability downlevel - which may involve leaving off some of the API surface - and type-forward on compatible runtimes.

What about the APIs added under System.Globalization? APIs like CompareInfo.IndexOf(ReadOnlySpan<char>, Rune, ...) cannot be implemented out-of-band without taking a significant performance hit. As it is, the OOB implementation (not the inbox implementation) of System.Text.Rune may already allocate for scenarios like Rune.GetUnicodeCategory(new Rune(0x10000)), but we would try to avoid allocations in the common case. For Rune-enlightened APIs on CompareInfo, we would not be able to avoid this allocation under even the most basic of scenarios. I don't think we should attempt to implement such APIs out-of-band.

dotnet-issue-labeler[bot] commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

jkotas commented 3 years ago

create a standalone System.Text.Unicode package

The naming pattern we have established for netstandard2.0 polyfills like this is Microsoft.Bcl.*, e.g. Microsoft.Bcl.AsyncInterfaces or Microsoft.Bcl.HashCode.

GrabYourPitchforks commented 3 years ago

Thanks for the tip! I've updated the proposal to read Microsoft.Bcl.Unicode.

joperezr commented 3 years ago

@GrabYourPitchforks is your idea to ship this new package in the 6.0 wave? I ask just to set the milestone.

@terrajobst what do you think? This is not really an API Proposal but just a new package proposal so not sure what the right process is for approving something like this.

GrabYourPitchforks commented 3 years ago

@joperezr 6.0 would be ideal, yes, so that our existing packages could remove their copies of this code and take the dependency appropriately.

krwq commented 3 years ago

@GrabYourPitchforks have made any progress? Is this still planned for 6.0?

GrabYourPitchforks commented 3 years ago

@krwq I made significant progress on this (see here), but since we're so late in the process I don't think we can hit 6.0. Spoke with Jeff about this as well, and will spend a little time looking through other code bases to see if they can benefit from these types or whether we should change which APIs are available in these packages.

krwq commented 3 years ago

@GrabYourPitchforks I'll move this to 7.0.0, please let us know if there are any updates here.

jeffhandley commented 1 year ago

There hasn't been much demand for this posted on this issue, but we're not quite ready to close it out. We'd like to invite input and requests on this issue. If you have scenarios where these APIs would be valuable in downlevel scenarios, please share your needs here. If we don't hear of substantial needs over the next couple months, we will close this out.

/cc @geeknoid @KrzysztofCwalina

KrzysztofCwalina commented 1 year ago

Azure SDK does not need the Rune APIs, but we do need the UTF8 buffer-based APIs that Levi worked on. (including but not only the Utf8 static class above) It really would be good to have them in the platform.

eiriktsarpalis commented 11 months ago

I came across this issue while working on https://github.com/dotnet/runtime/issues/90352 -- briefly contemplated writing yet another Rune polyfill in STJ.

Neme12 commented 10 months ago

This would be helpful for me too as I have my own polyfill for Rune as well.