dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.47k stars 4.76k forks source link

System.Text.Utf8Char data type to represent UTF-8 text data #28204

Open GrabYourPitchforks opened 5 years ago

GrabYourPitchforks commented 5 years ago

(Related: https://github.com/dotnet/corefx/issues/30503)

Motivations and driving principles behind the Utf8Char proposal

Utf8Char is synonymous with Char: they represent a single UTF-8 code unit and a single UTF-16 code unit, respectively. They are distinct from the integral types Byte and UInt16 in that sequences of the UTF-* code unit types are meant to represent textual data, while sequences of the integral types are meant to represent binary data.

Drawing this distinction is important. With UTF-16 data (String, Char[]), this distinction historically hasn't been a source of confusion. Developers are generally cognizant of the fact that aside from RPC, most i/o involves some kind of transcoding mechanism. Binary data doesn't come in from disk or the network in a format that can be trivially projected as a textual string; it must go through validation, recombining, and substitution. Similarly, when writing a string to disk or the network, a trivial projection is again impossible. The transcoding step must run in reverse to get the text data into the correct binary format expected by i/o.

A brief interlude on conformance and security

There is a key aspect here that is often lost in nuance. The purpose of the transcoding step isn't simply to "shrink" a string of UTF-16 code units into a string of UTF-8 code units (conveniently the same size as octets!) so that it can be blasted across the wire, or vice versa. It is to do so in such a manner that the receiver can reconstruct the original string with full fidelity.

With UTF-8, it is tempting to perform a trivial projection between the binary i/o layer (bytes) and the textual layer (UTF-8 code units). The elemental data types are the same shape, after all, so a reinterpret cast seems legal at first glance. The problem with this design is that at a certain point, one or more components will need to operate on this text. If the text is ill-formed, the components may produce undefined behavior, or they may attempt to fix up the text on-the-fly but may disagree on the final shape of the fixed-up text. This violates the "with full fidelity" aspect mentioned in the previous paragraph.

As a concrete example, consider a web application that blindly treats all incoming form data as ReadOnlySpan<byte> and attempts to interpret it as UTF-8. Within the context of this single web application, there may not be a problem with this design. If the buffer contains ill-formed UTF-8, all of the APIs in the web application process might have undefined behavior as they're working with it, but they likely have consistent undefined behavior.

Web applications almost never exist as a single isolated process, however. There is undoubtedly a persistent data store - a database or other backend service. If the web application forwards the ReadOnlySpan<byte> (containing ill-formed UTF-8) through to these layers, the backend layers could look at the same sequence of bytes and process them differently. Perhaps Component A is using varchar(UTF8) for its backend storage but Component B is using nvarchar for its backend storage. There is now a mismatch - a loss of fidelity - between these two systems.

This places us into a somewhat peculiar position with respect to security. We generally think of CVEs as affecting individual frameworks or individual applications, but this underrepresents a class of issues best described as "the API surface leads developers to writing applications which appear secure in isolation but which are in fact dangerous when used in conjunction with other applications."

Some examples of where vulnerabilities arise due to the interplay of components which handle ill-formed sequences UTF-8 differently:

These issues tend to go underreported in the public sphere because the attack often must be tailored to a specific deployment or configuration of an application.

Back to Utf8Char

The proposal ultimately is to have ReadOnlySpan<Utf8Char> represent well-formed UTF-8 text data as much as possible. This mirrors ReadOnlySpan<Char>, which generally represents well-formed UTF-16 text data. In both cases it's possible for a developer to intentionally create ill-formed payloads by creating and populating a Utf8Char[] or Char[] with garbage and then producing a span over that buffer. But since developers tend not to take such actions intentionally this shouldn't be a problem in practice. The standard way of getting a ReadOnlySpan<Utf8Char> from a ReadOnlySpan<byte> would be to use a factory that validates (and massages if necessary) the input data. This matches the behavior developers already expect when going from a byte sequence to a UTF-16 char sequence.

Generally speaking, Framework APIs which operate on ReadOnlySpan<byte> as UTF-8 input must not assume the input is well-formed and must have a well-defined behavior if ill-formed UTF-8 is encountered. The API may choose to take any number of actions - throw, return a failure code, perform replacement - as long as the behavior is part of the API contract and the caller understands this contract.

Framework APIs which operate on ReadOnlySpan<Utf8Char> should validate the input for well-formedness if such checks do not impose a hardship on the method implementation. There may be certain performance-sensitive routines which cannot incur that cost; such routines may assume the input is well-formed and may have undefined behavior if this invariant is violated, short of that behavior causing an access violation or other runtime corruption. For example, if a routine is given the single-element input [ C2 ], it mustn't attempt to read off the end of the source buffer. Routines which require well-formed input must be contracted as such. Chunking APIs (discussed later) must at the very least continue to check for boundary conditions, even if they don't check for other ill-formedness in the sequence.

For more information on conformance, validation, and the distinction between binary data and textual data, see:

Projections between Utf8Char and Byte

The UTF-8 code unit type Utf8Char does not attempt to validate its input.

Utf8Char c = (Utf8Char)(byte)0xC0; // creates a Utf8Char with the value C0
Rune r = new Rune(0xD800); // throws at runtime

In the above example, this creates a Utf8Char instance with the value C0, even though the Unicode Specification expressly states that C0 is never a valid value for a UTF-8 code unit. Contrast this with the Rune type, whose constructor prohibits creating instances from values outside the valid Unicode scalar range.

It is possible to project (reinterpret cast) a {ReadOnly}Span<Utf8Char> to a ReadOnlySpan<byte>. This is useful for operations like writing UTF-8 text directly to an i/o pipe.

ReadOnlySpan<Utf8Char> utf8 = ...;
ReadOnlySpan<byte> bytes = utf8.AsBytes();
stream.Write(bytes);

The projections Span<Utf8Char> -> Span<byte> and {ReadOnly}Span<byte> -> {ReadOnly}Span<Utf8Char> should also be possible. We do not want to prevent developers from removing any safety rails we provide within the Framework, but we also don't want developers to remove those rails inadvertently. Projections which blur the lines between textual representation and binary representation in a "dangerous" manner should require an affirmative action from the developer. One possible way to get this affirmation is to require use of the existing reinterpret_cast-like API.

ReadOnlySpan<Utf8Char> a = MemoryMarshal.Cast<byte, Utf8Char>(ReadOnlySpan<byte>);
Span<Utf8Char> b = MemoryMarshal.Cast<byte, Utf8Char>(Span<byte>);
Span<byte> c = MemoryMarshal.Cast<Utf8Char, byte>(Span<Utf8Char>);

The methods Span<T>.ToString and Memory<T>.ToString (and their read-only equivalents) will be enlightened for T = Utf8Char, just as they're enlightened for T = char today. The behavior of the method will be to transcode the data to UTF-16 (with invalid sequence replacement if necessary) and to return the expected String instance. This enlightenment will not extend to the case where T = byte.

Unlike Span<T>, Memory<T> instances cannot be projected to a different type Memory<U>. This means that there is no way to cast between Memory<Utf8Char> and Memory<byte> (or their read-only equivalents) on-the-fly.

Utf8String utf8 = new Utf8String("hello"); // initialized from literal
ReadOnlySpan<Utf8Char> asUtf8Chars = utf8; // implicit operator

// the below line could also be ROS<byte> asBytes = utf8.AsBytes();
ReadOnlySpan<byte> asBytes = asUtf8Chars.AsBytes();

fixed (byte* pA = asBytes)
fixed (Utf8Char* pB = asUtf8Chars)
{
   Debug.Assert((void*)pA == (void*)pB); // same reference
}

Debug.Assert(asBytes.ToString() == "ReadOnlySpan<byte>[5]");
Debug.Assert(asUtf8Chars.ToString() == "hello");

Comparing to other languages

In Go 1.x, string and []byte are distinct sliceable types. Developers generally use strings to store textual data and byte slices to store binary data. This distinction is sometimes a bit blurry and developers may require external information (documentation, context, method names) to determine exactly what kind of information is represented by the slice, akin to using traditional char* pointers in C.

The biggest difference between the two types is that string represents truly immutable data (not just an immutable view into mutable data), where []byte represents mutable data. Thus there's no trivial projection possible between the two, and any conversion must necessarily be implemented as a copy. There are proposals for readonly slices in a future version of Go, though to the best of my knowledge these proposals have not been approved. If such a feature comes to fruition it seems like a non-copying projection string -> <readonly> []byte would be allowed implicitly, but the reverse projection <readonly> []byte -> string would still require a copy. (See https://github.com/golang/go/issues/20443 and https://groups.google.com/forum/#!topic/golang-dev/Y7j4B2r_eDw/discussion for further information.)

In Swift, it is possible to create a UTF-* view over any String instance. The corresponding types are String.UTF8View, String.UTF16View, and String.UTF32View. These types are specialized text sequence types distinct from normal binary data sequence types; though their elemental types of enumeration are UInt8, UInt16, and UInt32, respectively. This means that it is not possible to project String.UTF8View and [UInt8] between each other trivially; a copy must take place. (See https://developer.apple.com/documentation/swift/string/utf8view for further information.)

Alternative proposals

Utf8Slice

Instead of introducing a Utf8Char type and allowing ReadOnlySpan<Utf8Char> to represent a slice of UTF-8 textual data, one could imagine introducing a Utf8Slice type which is a thin wrapper around ReadOnlySpan<Byte>. Inspection or manipulation methods would operate on this type rather than exist as specialized extensions on MemoryExtensions. Utf8Slice would be indexable (with Byte as the elemental type).

There is some prior art here in that it's similar to how the Go language operates. But this leads to a problem in that Utf8Slice instances would be limited in functionality. They'd be immutable, requiring manipulation APIs to bounce through a separate byte sequence and wrap a new Utf8Slice around it. We'd have to determine if we'd want a heapable (ReadOnlyMemory<Byte>-based) sibling type. There would be confusion as to why there's asymmetry between this and the UTF-16 types. After these and other considerations we're basically reinventing the Utf8String proposal, so there's minimal benefit to Utf8Slice as proposed here.

Use ReadOnlySpan<byte> for everything

This is tempting from the perspective of a system that wants to treat everything as pass-through as much as possible, but I don't believe it's appropriate from the perspective of a framework. There are two main issues I have with this approach.

The first is that it interferes with the general concept of a type system and makes it more difficult to reason about code. If a developer has a Byte[] in their code, they shouldn't need the additional bookkeeping overhead of asking themselves "does this represent binary data like a JPG, or does this represent UTF-8 text?" Text-based extension methods (Contains, ToUpper, etc.) also shouldn't begin appearing for arbitrary binary data sequences.

The second is that this blurs the line between binary data and textual data, leading to the validation and conformance problems mentioned earlier. I don't want the framework to encourage developers to play fast and loose with this, potentially leading to undefined behavior in their applications. This is still subject to the earlier caveats: power developers should absolutely be able to project the data with minimal fuss, but this should be an affirmative action.

namespace System.Text
{
    // Represents the fundamental elemental type of UTF-8 textual data and is distinct
    // from System.Byte, similar to how System.Char is the fundamental elemental type
    // of UTF-16 textual data and is distinct from System.UInt16.
    //
    // Ideally the compiler would support various syntaxes for this, like:
    // Utf8Char theChar = 63; // Implicit assignment of const to local of type Utf8Char
    public readonly struct Utf8Char : IComparable<Utf8Char>, IEquatable<Utf8Char>
    {
        private readonly int _dummy;

        // Construction is performed via a cast. All casts are checked for overflow
        // but not for correctness. For example, casting -1 to Utf8Char will fail
        // with an OverflowException, but casting 0xFF to Utf8Char will succeed even
        // though 0xFF is never a valid UTF-8 code unit. Additionally, even though
        // the cast from Byte to Utf8Char can never overflow, it's still an explicit
        // cast because we don't want devs to fall into the habit of treating arbitrary
        // integral types as equivalent to textual data types. As an existing example of
        // this in the current compiler, there's no implicit cast from Byte to Char even
        // though it's a widening operation, but there is an explicit cast.

        public static explicit operator Utf8Char(byte value) => throw null;
        public static explicit operator Utf8Char(sbyte value) => throw null;
        public static explicit operator Utf8Char(char value) => throw null;
        public static explicit operator Utf8Char(short value) => throw null;
        public static explicit operator Utf8Char(ushort value) => throw null;
        public static explicit operator Utf8Char(int value) => throw null;
        public static explicit operator Utf8Char(uint value) => throw null;
        public static explicit operator Utf8Char(long value) => throw null;
        public static explicit operator Utf8Char(ulong value) => throw null;

        // Casts to the various primitive integral types. All casts are implicit
        // with two exceptions, which are explicit:
        // - Cast to SByte, because it could result in an OverflowException.
        // - Cast to Char, for the same reason as the Byte-to-Utf8Char cast.

        public static implicit operator byte(Utf8Char value) => throw null;
        public static explicit operator sbyte(Utf8Char value) => throw null;
        public static explicit operator char(Utf8Char value) => throw null;
        public static implicit operator short(Utf8Char value) => throw null;
        public static implicit operator ushort(Utf8Char value) => throw null;
        public static implicit operator int(Utf8Char value) => throw null;
        public static implicit operator uint(Utf8Char value) => throw null;
        public static implicit operator long(Utf8Char value) => throw null;
        public static implicit operator ulong(Utf8Char value) => throw null;

        public static bool operator ==(Utf8Char a, Utf8Char b) => throw null;
        public static bool operator !=(Utf8Char a, Utf8Char b) => throw null;
        public static bool operator <(Utf8Char a, Utf8Char b) => throw null;
        public static bool operator <=(Utf8Char a, Utf8Char b) => throw null;
        public static bool operator >(Utf8Char a, Utf8Char b) => throw null;
        public static bool operator >=(Utf8Char a, Utf8Char b) => throw null;
        public int CompareTo(Utf8Char other) => throw null;
        public override bool Equals(object obj) => throw null;
        public bool Equals(Utf8Char other) => throw null;
        public override int GetHashCode() => throw null;
        public override string ToString() => throw null;
    }

Due to the way this type is defined and the presence of implicit conversion operators, mathematical and comparison operators behave as expected. Some examples are given below.

Utf8Char a = (Utf8Char)42;
var b = a + 10; // = int 52
bool c = (a == b); // int to int comparison, returns false
bool d = (a == (b - 10)); // int to int comparison, returns true
long e = a; // = long 42
byte f = (byte)a; // = byte 42
Utf8Char g = (Utf8Char)(-b); // = OverflowException
Utf8Char h = (Utf8Char)(byte)(-b); // = 0xCC

Sample APIs for operating with Utf8Char sequences follow.

namespace System.IO
{
   // EXISTING class - NEW methods
   public abstract class TextWriter
   {
      /*
       * Since TextWriter uses UTF-8 as its default encoding, calling Write on these
       * instances with input which is already UTF-8 is a simple projection
       * without the overhead of transcoding.
       */

      public virtual void Write(ReadOnlySpan<Utf8Char> buffer);
      public virtual Task WriteAsync(ReadOnlyMemory<Utf8Char> buffer, CancellationToken cancellationToken = default);
      public virtual void WriteLine(ReadOnlySpan<Utf8Char> buffer);
      public virtual Task WriteLineAsync(ReadOnlyMemory<Utf8Char> buffer, CancellationToken cancellationToken = default);
   }

   // EXISTING class - NEW methods
   public class BinaryWriter
   {
      public virtual void Write(ReadOnlySpan<Utf8Char> utf8Chars);
   }
}

namespace System.Text
{
   // EXISTING class - NEW methods
   public struct Rune
   {
      public bool TryEncodeUtf8(Span<Utf8Char> destination, out int utf8CharsWritten);
   }
}

namespace System
{
   public static class MemoryExtensions
   {
      // project UTF-8 text buffer as binary
      public static ReadOnlySpan<byte> AsBytes(this ReadOnlySpan<Utf8Char> utf8Chars);

      public static int CompareTo(this ReadOnlySpan<Utf8Char> span, ReadOnlySpan<Utf8Char> other, StringComparison comparisonType);

      public static bool Contains(this ReadOnlySpan<Utf8Char> span, ReadOnlySpan<Utf8Char> value, StringComparison comparisonType);

      // convenience overloads that are equivalent to 'Ordinal'
      public static bool Contains(this ReadOnlySpan<Utf8Char> span, char value);
      public static bool Contains(this ReadOnlySpan<Utf8Char> span, Rune value);

      public static bool EndsWith(this ReadOnlySpan<Utf8Char> span, ReadOnlySpan<Utf8Char> value, StringComparison comparisonType);

      public static Utf8SpanRuneEnumerator EnumerateRunes(this ReadOnlySpan<Utf8Char> span);
      public static Utf8SpanRuneEnumerator EnumerateRunes(this Span<Utf8Char> span);

      public static bool Equals(this ReadOnlySpan<Utf8Char> span, ReadOnlySpan<Utf8Char> other, StringComparison comparisonType);

      public static int IndexOf(this ReadOnlySpan<Utf8Char> span, ReadOnlySpan<Utf8Char> value, StringComparison comparisonType);

      public static bool IsWhiteSpace(this ReadOnlySpan<Utf8Char> span);

      public static int LastIndexOf(this ReadOnlySpan<Utf8Char> span, ReadOnlySpan<Utf8Char> value, StringComparison comparisonType);

      public static bool StartsWith(this ReadOnlySpan<Utf8Char> span, ReadOnlySpan<Utf8Char> value, StringComparison comparisonType);

      public static ReadOnlySpan<Utf8Char> Trim(this ReadOnlySpan<Utf8Char> span);
      public static ReadOnlySpan<Utf8Char> TrimEnd(this ReadOnlySpan<Utf8Char> span);
      public static ReadOnlySpan<Utf8Char> TrimStart(this ReadOnlySpan<Utf8Char> span);

      /*
       * Differences in behavior from UTF-16 ToLower/ToUpper  methods:
       *
       * - These methods may begin writing to the destination  buffer before they realize that the destination
       *   buffer is too small, so a return value of -1 means  that the output buffer may contain nonsense.
       *
       * - If the source and destination buffers overlap at  all, the result is undefined. (The UTF-16 behavior
       *   is that if source and destination are identical,  it's an in-place transformation; but if source and
       *   destination have a partial overlap, the result is  undefined.)
       *
       * These differences are due to the fact that UTF-16  text does not change length (code unit count) when
       * undergoing simple case conversion, but UTF-8 text  may change length (code unit count) when undergoing
       * simple case conversion.
       *
       * Like the UTF-16 equivalents, the UTF-8 methods  result in undefined output given ill-formed source  data.
       */
      public static int ToLower(this ReadOnlySpan<Utf8Char> source, Span<Utf8Char> destination, CultureInfo culture);
      public static int ToLowerInvariant(this ReadOnlySpan<Utf8Char> source, Span<Utf8Char> destination);
      public static int ToUpper(this ReadOnlySpan<Utf8Char> source, Span<Utf8Char> destination, CultureInfo culture);
      public static int ToUpperInvariant(this ReadOnlySpan<Utf8Char> source, Span<Utf8Char> destination);
   }
}

namespace System.Text.Unicode
{
   /*
    * NEW class to contain static utility & validation methods. Under System.Text.Unicode namespace so as not to
    * interfere with 'Utf8' type names that might already exist in application code. If we instead want to put
    * this under the default namespace System.Text, suggest Utf8Utility (existing naming convention).
    *
    * And yes, "Utf8" is the proper capitalization per our design guidelines since the acronym is greater than 2 letters
    * and doesn't represent a trademarked term, prior art be damned. :)
    */
   public static class Utf8
   {
      public static bool IsWellFormedSequence(ReadOnlySpan<byte> sequence);
      public static bool IsWellFormedSequence(ReadOnlySpan<Utf8Char> sequence);
   }

   public static class Utf16
   {
      public static bool IsWellFormedSequence(ReadOnlySpan<char> sequence);
   }
}

Not shown in the above APIs are UTF-8 overloads for common networking APIs like IPAddress.Parse(ReadOnlySpan<Utf8Char>), etc. It will require additional effort to enumerate the entire list of APIs which should be UTF-8 enlightened as part of this effort. For now this issue focuses mainly on text processing and manipulation APIs.

Chunking APIs

While ReadOnlySpan<char> and ReadOnlySpan<Utf8Char> should represent standalone well-formed UTF-* sequences as much as possible, we must recognize that applications which work on discontiguous buffers cannot always guarantee this property. Often an application will be required to chunk a large piece of text into several smaller buffers due to performance considerations. This behavior is seen in existing Framework types like StringBuilder and ReadOnlySequence<T>.

This chunking could occur such that slice boundaries occur in the middle of a multi-code unit sequence. In these cases the individual chunks may be ill-formed, but the logical concatenation of these chunks represent a well-formed supersequence. A UTF-16 example is the well-formed sequence [ D808 DF45 ] chunked into the ill-formed subsequences [ D808 ] and [ DF45 ]. A UTF-8 example is the well-formed sequence [ F0 92 8D 85 ] chunked into the ill-formed sequences [ F0 ] and [ 92 8D 85 ]. The Framework should provide OperationStatus-based APIs as much as possible to enable this scenario.

Of special note is that some text operations cannot be performed in a chunked fashion. APIs like case conversion (ToUpper, ToLower) and transcoding can be created to allow for chunking, but comparison APIs (CompareTo, StartsWith) do not allow for chunking. A concrete example of this follows. In this example, chunking will cause StartsWith to return a false positive result.

// Assumes current culture is en-US
static void Main(string[] args)
{
   string theString = "e\u0301"; // [ 'e', '\u0301' ]
   Console.WriteLine(theString.StartsWith("e")); // prints "False"

   theString = theString.Substring(0, 1); // chunk to [ 'e' ]
   Console.WriteLine(theString.StartsWith("e")); // prints "True"
}

Since chunking is not unique to UTF-8, the Framework should provide chunking APIs for both UTF-8 and UTF-16 data. The Framework should also provide chunking APIs for transcoding routines. The existing OperationStatus type can be utilized for this purpose.

namespace System.Text.Unicode
{
   public static class Utf8
   {
      /*
       * OperationStatus-based APIs for case conversion of chunked UTF-8 data.
       * This method *may* return OperationStatus.InvalidData when given ill-formed
       * input, though the underlying localization library may opt to handle this
       * scenario itself during case conversion. The behavior may differ when the
       * invariant globalization mode is active. (One possible behavior is that the
       * library may choose to propagate ill-formed subsequences from the source
       * buffer to the destination buffer unmodified.)
       *
       * If the source and destination buffers overlap at all, the destination buffer
       * contents will be undefined.
       *
       * 'numCharsRead' may or may not equal 'numCharsWritten' on method return.
       */

      public static OperationStatus ToUpper(ReadOnlySpan<Utf8Char> source, Span<Utf8Char> destination, CultureInfo culture, bool isFinalChunk, out int numCharsRead, out int numCharsWritten);
      public static OperationStatus ToUpperInvariant(ReadOnlySpan<Utf8Char> source, Span<Utf8Char> destination, bool isFinalChunk, out int numCharsRead, out int numCharsWritten);
      public static OperationStatus ToLower(ReadOnlySpan<Utf8Char> source, Span<Utf8Char> destination, CultureInfo culture, bool isFinalChunk, out int numCharsRead, out int numCharsWritten);
      public static OperationStatus ToLowerInvariant(ReadOnlySpan<Utf8Char> source, Span<Utf8Char> destination, bool isFinalChunk, out int numCharsRead, out int numCharsWritten);

      /*
       * OperationStatus-based APIs for transcoding of chunked data.
       * This method is similar to Encoding.UTF8.GetBytes / GetChars but has a
       * different calling convention, different error handling mechanisms, and
       * different performance characteristics.
       *
       * If 'replaceInvalidSequences' is true, the method will replace any ill-formed
       * subsequence in the source with U+FFFD when transcoding to the destination,
       * then it will continue processing the remainder of the buffers. Otherwise
       * the method will return OperationStatus.InvalidData.
       *
       * If the method does return an error code, the out parameters will represent
       * how much of the data was successfully transcoded, and the location of the
       * ill-formed subsequence can be deduced from these values.
       *
       * If 'replaceInvalidSequences' is true, the method is guaranteed never to return
       * OperationStatus.InvalidData. If 'isFinalChunk' is true, the method is
       * guaranteed never to return OperationStatus.NeedMoreData.
       *
       * Byte-based overloads are provided alongside Utf8Char-based overloads so that
       * transcoding can be performed directly from or directly to binary network buffers.
       */

      public static OperationStatus FromChars(ReadOnlySpan<char> source, Span<Utf8Char> destination, bool replaceInvalidSequences, bool isFinalChunk, out int numCharsRead, out int numCharsWritten);
      public static OperationStatus FromChars(ReadOnlySpan<char> source, Span<byte> utf8Destination, bool replaceInvalidSequences, bool isFinalChunk, out int numCharsRead, out int numBytesWritten);

      public static OperationStatus ToChars(ReadOnlySpan<byte> utf8Source, Span<char> destination, bool replaceInvalidSequences, bool isFinalChunk, out int numBytesRead, out int numCharsWritten);
      public static OperationStatus ToChars(ReadOnlySpan<Utf8Char> source, Span<char> destination, bool replaceInvalidSequences, bool isFinalChunk, out int numCharsRead, out int numCharsWritten);

      /*
       * OperationStatus-based API that copies data from a binary data buffer into a
       * text buffer, validating well-formedness during the copy and optionally
       * patching ill-formed subsequences in the destination.
       *
       * If 'replaceInvalidSequences' is true, the method will replace any ill-formed
       * subsequence in the source with U+FFFD when transcoding to the destination,
       * then it will continue processing the remainder of the buffers. Otherwise
       * the method will return OperationStatus.InvalidData.
       *
       * If 'replaceInvalidSequences' is true, the method is guaranteed never to return
       * OperationStatus.InvalidData. If 'isFinalChunk' is true, the method is
       * guaranteed never to return OperationStatus.NeedMoreData.
       */

      public static OperationStatus ToValidUtf8Chars(ReadOnlySpan<byte> utf8Source, Span<Utf8Char> destination, bool replaceInvalidSequences, bool isFinalChunk, out int numBytesRead, out int numCharsWritten);
   }

   public static class Utf16
   {
      /*
       * OperationStatus-based APIs for case conversion of chunked UTF-16 data.
       * This method *may* return OperationStatus.InvalidData when given ill-formed
       * input, though the underlying localization library may opt to handle this
       * scenario itself during case conversion. The behavior may differ when the
       * invariant globalization mode is active. (One possible behavior is that the
       * library may choose to propagate ill-formed subsequences from the source
       * buffer to the destination buffer unmodified.)
       *
       * The same buffer can be provided as both source and destination to perform
       * an in-place case conversion. If the buffers overlap only partially, the
       * buffer contents will be undefined.
       *
       * 'numCharsRead' will always equal 'numCharsWritten'.
       */

      public static OperationStatus ToUpper(ReadOnlySpan<char> source, Span<char> destination, CultureInfo culture, bool isFinalChunk, out int numCharsRead, out int numCharsWritten);
      public static OperationStatus ToUpperInvariant(ReadOnlySpan<char> source, Span<char> destination, bool isFinalChunk, out int numCharsRead, out int numCharsWritten);
      public static OperationStatus ToLower(ReadOnlySpan<char> source, Span<char> destination, CultureInfo culture, bool isFinalChunk, out int numCharsRead, out int numCharsWritten);
      public static OperationStatus ToLowerInvariant(ReadOnlySpan<char> source, Span<char> destination, bool isFinalChunk, out int numCharsRead, out int numCharsWritten);
   }
}
ericsampson commented 5 years ago

Levi, how exactly does this mesh with Rune?

ericsampson commented 5 years ago

Re the sections quoted below, this gives me the heeby-jeebies. From my one very small point of view, with an emphasis on MSRC-proofing, I would be much happier if Utf8Char and ReadOnlySpan were guaranteed to be valid always. Otherwise much of the value seems to go away, for me.

For the chunking use case, I'd propose some sort of intermediate representation that would convey the semantic intent of "chunk of a well-formed ReadOnlySpan supersequence".

Am I making sense? In my mind, all the concepts in this discussion are related to the concept of 'input taints', but encoded in the type system rather than naming convention.
@terrajobst

The UTF-8 code unit type Utf8Char does not attempt to validate its input (snip) In the above example, this creates a Utf8Char instance with the value C0, even though the Unicode Specification expressly states that C0 is never a valid value for a UTF-8 code unit.

Framework APIs which operate on ReadOnlySpan<Utf8Char> should validate the input for well-formedness if such checks do not impose a hardship on the method implementation. There may be certain performance-sensitive routines which cannot incur that cost; such routines may assume the input is well-formed and may have undefined behavior if this invariant is violated, short of that behavior causing an access violation or other runtime corruption. For example, if a routine is given the single-element input [ C2 ], it mustn't attempt to read off the end of the source buffer. Routines which require well-formed input must be contracted as such. Chunking APIs (discussed later) must at the very least continue to check for boundary conditions, even if they don't check for other ill-formedness in the sequence.

GrabYourPitchforks commented 5 years ago

I would be much happier if Utf8Char and ReadOnlySpan were guaranteed to be valid always.

Would be nice, but unfortunately it's impractical. It's analogous to Char: individual elements are never guaranteed to have any semantic meaning in isolation. It's only when you have a string of them that they mean something.

If we ignore language semantics for now and look only at well-formedness, the sequence [ D800 ] is not well-formed, but the sequence [ D800 DC00 ] is well-formed. We don't want to prevent construction of the standalone char 0xD800, because even though it's ill-formed in isolation it can be well-formed when it exists as part of a larger sequence. Similarly, the Utf8Char 0xC2 is ill-formed in isolation, but it can be well-formed when part of the larger sequence [ C2 80 ]. We don't want to prevent construction of the building block itself.

We also can't stop anybody from allocating a new Utf8Char[], populating it with garbage, and projecting a ReadOnlySpan<Utf8Char> over it. Though to be honest this doesn't really concern me that much. Nothing technically stops people from doing the same thing with char[] and ReadOnlySpan<char> today, but it's just not something developers really feel compelled to do in practice. char[] and string factory methods are so much more convenient that developers are more inclined to use those. I see this same thing happening in the UTF-8 world: use the factories and you don't have to worry about any of this.

Levi, how exactly does this mesh with Rune?

Rune, on the other hand, corresponds precisely to a Unicode scalar value. The well-formed UTF-8 sequence [ C2 80 ] will correspond to the Rune U+0080; the well-formed UTF-16 sequence [ D800 DC00 ] will correspond to the Rune U+10000. The transformation is reversible without loss of fidelity.

Consider the following transforms.

  1. Start with a well-formed UTF-8 sequence a.
  2. Decode that into a Rune sequence b.
  3. Encode b into a (UTF-16) char sequence c.
  4. Decode c into a Rune sequence d.
  5. Encode d into a UTF-8 sequence e.

At the end of this process, a and e will be bit-for-bit identical, and b and d will be element-for-element identical. Additionally, even though a and c aren't bit-for-bit identical, they can be said to represent the same data (because as demonstrated you can translate back and forth between the two losslessly).

If a did not begin as a well-formed sequence, then something would get lost in translation and a and e would not be bit-for-bit identical at the end.

Because the Rune constructor is validating, it is impossible for somebody to construct a bogus Rune instance without dropping down to unsafe-equivalent code. That we can assume all instances are well-formed also gives us some decent perf benefits in various methods by allowing us to elide correctness checks.

Am I making sense? In my mind, all the concepts in this discussion are related to the concept of 'input taints', but encoded in the type system rather than naming convention.

That's pretty much accurate. Barring what I mentioned earlier about somebody fabricating a bogus char[], a sequence of char means "somebody somewhere said that this buffer represents well-formed UTF-16 data, or at least it's a slice of a larger piece of well-formed UTF-16 data." A sequence of Utf8Char ostensibly means the same thing, but for UTF-8.

ericsampson commented 5 years ago

Thanks Levi, that helps a lot in framing things correctly in my mind. Would it make sense to have an API design convention that the surface area of APIs should generally prefer to expose Runes to consumers and reserve UTtf8Char/ReadOnlySpan for internal use (other than code that directly interacts with the wire) ? Just picturing trying to keep the unsafeness to a minimum surface area instead of it spreading all over random APIs/user-code/SO examples etc that don't really need the absoIute last % perf but would be better off safe. I guess this is one of those performance vs safety tradeoffs (trusting general dev population to do the right thing vs making the wrong thing impossible and accepting a certain performance penalty). Cheers

terrajobst commented 5 years ago

Video

It seems as a next step we should have a meeting with the language folks.

Neme12 commented 1 year ago

It's a shame this didn't get done :(

Neme12 commented 1 year ago

Are there currently any methods in .NET that would encapsulate this functionality (i.e. stuff like ToUpper, Contains, equality etc) for utf-8 ReadOnlySpan<byte>? I can't find anything.