dotnet / corefxlab

This repo is for experimentation and exploring new ideas that may or may not make it into the main corefx repo.
MIT License
1.46k stars 346 forks source link

Utf8String design proposal #2350

Closed GrabYourPitchforks closed 3 years ago

GrabYourPitchforks commented 6 years ago

Utf8String design discussion - last edited 14-Sep-19

Utf8String design overview

Audience and scenarios

Utf8String and related concepts are meant for modern internet-facing applications that need to speak "the language of the web" (or i/o in general, really). Currently applications spend some amount of time transcoding into formats that aren't particularly useful, which wastes CPU cycles and memory.

A naive way to accomplish this would be to represent UTF-8 data as byte[] / Span<byte>, but this leads to a usability pit of failure. Developers would then become dependent on situational awareness and code hygiene to be able to know whether a particular byte[] instance is meant to represent binary data or UTF-8 textual data, leading to situations where it's very easy to write code like byte[] imageData = ...; imageData.ToUpperInvariant();. This defeats the purpose of using a typed language.

We want to expose enough functionality to make the Utf8String type usable and desirable by our developer audience, but it's not intended to serve as a full drop-in replacement for its sibling type string. For example, we might add Utf8String-related overloads to existing APIs in the System.IO namespace, but we wouldn't add an overload Assembly.LoadFrom(Utf8String assemblyName).

In addition to networking and i/o scenarios, it's expected that there will be an audience who will want to use Utf8String for interop scenarios, especially when interoperating with components written in Rust or Go. Both of these languages use UTF-8 as their native string representation, and providing a type which can be used as a data exchange type for that audience will make their scenarios a bit easier.

Finally, we should afford power developers the opportunity to improve their throughput and memory utilization by limiting data copying where feasible. This doesn't imply that we must be allocation-free or zero-copy for every scenario. But it does imply that we should investigate common operations and consider alternative ways of performing these tasks as long as it doesn't compromise the usability of the mainline scenarios.

It's important to call out that Utf8String is not intended to be a replacement for string. The standard UTF-16 string will remain the core primitive type used throughout the .NET ecosystem and will enjoy the largest supported API surface area. We expect that developers who use Utf8String in their code bases will do so deliberately, either because they're working in one of the aforementioned scenarios or because they find other aspects of Utf8String (such as its API surface or behavior guarantees) desirable.

Design decisions and type API

To make internal Utf8String implementation details easier, and to allow consumers to better reason about the type's behavior, the Utf8String type maintains the following invariants:

These invariants help shape the proposed API and usage examples as described throughout this document.

[Serializable]
public sealed class Utf8String : IComparable<Utf8String>, IEquatable<Utf8String>, ISerializable
{
    public static readonly Utf8String Empty; // matches String.Empty

    /*
     * CTORS AND FACTORIES
     *
     * These ctors all have "throw on invalid data" behavior since it's intended that data should
     * be faithfully retained and should be round-trippable back to its original encoding.
     */

    public Utf8String(byte[]? value, int startIndex, int length);
    public Utf8String(char[]? value, int startIndex, int length);
    public Utf8String(ReadOnlySpan<byte> value);
    public Utf8String(ReadOnlySpan<char> value);
    public Utf8String(string value) { }

    // These ctors expect null-terminated UTF-8 or UTF-16 input.
    // They'll compute strlen / wcslen on the caller's behalf.

    public unsafe Utf8String(byte* value);
    public unsafe Utf8String(char* value);

    public static Utf8String Create<TState>(int length, TState state, SpanAction<byte, TState> action);

    // "Try" factories are non-throwing equivalents of the above methods. They use a try pattern instead
    // of throwing if invalid input is detected.

    public static bool TryCreateFrom(ReadOnlySpan<byte> buffer, out Utf8String? value);
    public static bool TryCreateFrom(ReadOnlySpan<char> buffer, out Utf8String? value);

    // "Loose" factories also perform validation, but if an invalid sequence is detected they'll
    // silently fix it up by performing U+FFFD substitution in the returned Utf8String instance
    // instead of throwing.

    public static Utf8String CreateFromLoose(ReadOnlySpan<byte> buffer);
    public static Utf8String CreateFromLoose(ReadOnlySpan<char> buffer);
    public static Utf8String CreateLoose<TState>(int length, TState state, SpanAction<byte, TState> action);

    // "Unsafe" factories skip validation entirely. It's up to the caller to uphold the invariant
    // that Utf8String instances only ever contain well-formed UTF-8 data.

    [RequiresUnsafe]
    public static Utf8String UnsafeCreateWithoutValidation(ReadOnlySpan<byte> utf8Contents);
    [RequiresUnsafe]
    public static Utf8String UnsafeCreateWithoutValidation<TState>(int length, TState state, SpanAction<byte, TState> action);

    /*
     * ENUMERATION
     *
     * Since there's no this[int] indexer on Utf8String, these properties allow enumeration
     * of the contents as UTF-8 code units (Bytes), as UTF-16 code units (Chars), or as
     * Unicode scalar values (Runes). The enumerable struct types are defined at the bottom
     * of this type.
     */

    public ByteEnumerable Bytes { get; }
    public CharEnumerable Chars { get; }
    public RuneEnumerable Runes { get; }

    // Also allow iterating over extended grapheme clusters (not yet ready).
    // public GraphemeClusterEnumerable GraphemeClusters { get; }

    /*
     * COMPARISON
     *
     * All comparisons are Ordinal unless the API takes a parameter such
     * as a StringComparison or CultureInfo.
     */

    // The "AreEquivalent" APIs compare UTF-8 data against UTF-16 data for equivalence, where
    // equivalence is defined as "the texts would transcode as each other".
    // (Shouldn't these methods really be on a separate type?)

    public static bool AreEquivalent(Utf8String? utf8Text, string? utf16Text);
    public static bool AreEquivalent(Utf8Span utf8Text, ReadOnlySpan<char> utf16Text);
    public static bool AreEquivalent(ReadOnlySpan<byte> utf8Text, ReadOnlySpan<char> utf16Text);

    public int CompareTo(Utf8String? other);
    public int CompareTo(Utf8String? other, StringComparison comparisonType);

    public override bool Equals(object? obj); // 'obj' must be Utf8String, not string
    public static bool Equals(Utf8String? left, Utf8String? right);
    public static bool Equals(Utf8String? left, Utf8String? right, StringComparison comparisonType);
    public bool Equals(Utf8String? value);
    public bool Equals(Utf8String? value, StringComparison comparisonType);

    public static bool operator !=(Utf8String? left, Utf8String? right);
    public static bool operator ==(Utf8String? left, Utf8String? right);

    /*
     * SEARCHING
     *
     * Like comparisons, all searches are Ordinal unless the API takes a
     * parameter dictating otherwise.
     */

    public bool Contains(char value);
    public bool Contains(char value, StringComparison comparisonType);
    public bool Contains(Rune value);
    public bool Contains(Rune value, StringComparison comparisonType);
    public bool Contains(Utf8String value);
    public bool Contains(Utf8String value, StringComparison comparisonType);

    public bool EndsWith(char value);
    public bool EndsWith(char value, StringComparison comparisonType);
    public bool EndsWith(Rune value);
    public bool EndsWith(Rune value, StringComparison comparisonType);
    public bool EndsWith(Utf8String value);
    public bool EndsWith(Utf8String value, StringComparison comparisonType);

    public bool StartsWith(char value);
    public bool StartsWith(char value, StringComparison comparisonType);
    public bool StartsWith(Rune value);
    public bool StartsWith(Rune value, StringComparison comparisonType);
    public bool StartsWith(Utf8String value);
    public bool StartsWith(Utf8String value, StringComparison comparisonType);

    // TryFind is the equivalent of IndexOf. It returns a Range instead of an integer
    // index because there's no this[int] indexer on the Utf8String type, and encouraging
    // developers to slice by integer indices will almost certainly lead to bugs.
    // More on this later.

    public bool TryFind(char value, out Range range);
    public bool TryFind(char value, StringComparison comparisonType, out Range range);
    public bool TryFind(Rune value, out Range range);
    public bool TryFind(Rune value, StringComparison comparisonType, out Range range);
    public bool TryFind(Utf8String value, out Range range);
    public bool TryFind(Utf8String value, StringComparison comparisonType, out Range range);

    public bool TryFindLast(char value, out Range range);
    public bool TryFindLast(char value, StringComparison comparisonType, out Range range);
    public bool TryFindLast(Rune value, out Range range);
    public bool TryFindLast(Rune value, StringComparison comparisonType, out Range range);
    public bool TryFindLast(Utf8String value, out Range range);
    public bool TryFindLast(Utf8String value, StringComparison comparisonType, out Range range);

    /*
     * SLICING
     *
     * All slicing operations uphold the "well-formed data" invariant and
     * validate that creating the new substring instance will not split a
     * multi-byte UTF-8 subsequence. This check is O(1).
     */

    public Utf8String this[Range range] { get; }

    public (Utf8String Before, Utf8String? After) SplitOn(char separator);
    public (Utf8String Before, Utf8String? After) SplitOn(char separator, StringComparison comparisonType);
    public (Utf8String Before, Utf8String? After) SplitOn(Rune separator);
    public (Utf8String Before, Utf8String? After) SplitOn(Rune separator, StringComparison comparisonType);
    public (Utf8String Before, Utf8String? After) SplitOn(Utf8String separator);
    public (Utf8String Before, Utf8String? After) SplitOn(Utf8String separator, StringComparison comparisonType);

    public (Utf8String Before, Utf8String? After) SplitOnLast(char separator);
    public (Utf8String Before, Utf8String? After) SplitOnLast(char separator, StringComparison comparisonType);
    public (Utf8String Before, Utf8String? After) SplitOnLast(Rune separator);
    public (Utf8String Before, Utf8String? After) SplitOnLast(Rune separator, StringComparison comparisonType);
    public (Utf8String Before, Utf8String? After) SplitOnLast(Utf8String separator);
    public (Utf8String Before, Utf8String? After) SplitOnLast(Utf8String separator, StringComparison comparisonType);

    /*
     * INSPECTION & MANIPULATION
     */

    // some number of overloads to help avoid allocation in the common case
    public static Utf8String Concat<T>(params IEnumerable<T> values);
    public static Utf8String Concat<T0, T1>(T0 value0, T1 value1);
    public static Utf8String Concat<T0, T1, T2>(T0 value0, T1 value1, T2 value2);

    public bool IsAscii();

    public bool IsNormalized(NormalizationForm normalizationForm = NormalizationForm.FormC);

    public static Utf8String Join<T>(char separator, params IEnumerable<T> values);
    public static Utf8String Join<T>(Rune separator, params IEnumerable<T> values);
    public static Utf8String Join<T>(Utf8String? separator, params IEnumerable<T> values);

    public Utf8String Normalize(NormalizationForm normalizationForm = NormalizationForm.FormC);

    // Do we also need Insert, Remove, etc.?

    public Utf8String Replace(char oldChar, char newChar); // Ordinal
    public Utf8String Replace(char oldChar, char newChar, StringComparison comparison);
    public Utf8String Replace(char oldChar, char newChar, bool ignoreCase, CultureInfo culture);
    public Utf8String Replace(Rune oldRune, Rune newRune); // Ordinal
    public Utf8String Replace(Rune oldRune, Rune newRune, StringComparison comparison);
    public Utf8String Replace(Rune oldRune, Rune newRune, bool ignoreCase, CultureInfo culture);
    public Utf8String Replace(Utf8String oldText, Utf8String newText); // Ordinal
    public Utf8String Replace(Utf8String oldText, Utf8String newText, StringComparison comparison);
    public Utf8String Replace(Utf8String oldText, Utf8String newText, bool ignoreCase, CultureInfo culture);

    public Utf8String ToLower(CultureInfo culture);
    public Utf8String ToLowerInvariant();

    public Utf8String ToUpper(CultureInfo culture);
    public Utf8String ToUpperInvariant();

    // The Trim* APIs only trim whitespace for now. When we figure out how to trim
    // additional data we can add the appropriate overloads.

    public Utf8String Trim();
    public Utf8String TrimStart();
    public Utf8String TrimEnd();

    /*
     * PROJECTING
     */

    public ReadOnlySpan<byte> AsBytes(); // perhaps an extension method instead?
    public static explicit operator ReadOnlySpan<byte>(Utf8String? value);
    public static implicit operator Utf8Span(Utf8String? value);

    /*
     * MISCELLANEOUS
     */

    public override int GetHashCode(); // Ordinal
    public int GetHashCode(StringComparison comparisonType);

    // Used for pinning and passing to p/invoke. If the input Utf8String
    // instance is empty, returns a reference to the null terminator.

    [EditorBrowsable(EditorBrowsableState.Never)]
    public ref readonly byte GetPinnableReference();

    public static bool IsNullOrEmpty(Utf8String? value);
    public static bool IsNullOrWhiteSpace(Utf8String? value);

    public override string ToString(); // transcode to UTF-16

    /*
     * SERIALIZATION
     * (Throws an exception on deserialization if data is invalid.)
     */

    // Could also use an IObjectReference if we didn't want to implement the deserialization ctor.
    private Utf8String(SerializationInfo info, StreamingContext context);
    void ISerializable.GetObjectData(SerializationInfo info, StreamingContext context);

    /*
     * HELPER NESTED STRUCTS
     */

    public readonly struct ByteEnumerable : IEnumerable<byte> { /* ... */ }
    public readonly struct CharEnumerable : IEnumerable<char> { /* ... */ }
    public readonly struct RuneEnumerable : IEnumerable<Rune> { /* ... */ }
}

public static class MemoryExtensions
{
    public static ReadOnlyMemory<byte> AsMemory(Utf8String value);
    public static ReadOnlyMemory<byte> AsMemory(Utf8String value, int offset);
    public static ReadOnlyMemory<byte> AsMemory(Utf8String value, int offset, int count);
}

Non-allocating types

While Utf8String is an allocating, heap-based, null-terminated type; there are scenarios where a developer may want to represent a segment (or "slice") of UTF-8 data from an existing buffer without incurring an allocation.

The Utf8Segment (alternative name: Utf8Memory) and Utf8Span types can be used for this purpose. They represent a view into UTF-8 data, with the following guarantees:

These types have Utf8String-like methods hanging off of them as instance methods where appropriate. Additionally, they can be projected as ROM<byte> and ROS<byte> for developers who want to deal with the data at the raw binary level or who want to call existing extension methods on the ROM and ROS types.

Since Utf8Segment and Utf8Span are standalone types distinct from ROM and ROS, they can have behaviors that developers have come to expect from string-like types. For example, Utf8Segment (unlike ROM<char> or ROM<byte>) can be used as a key in a dictionary without jumping through hoops:

Dictionary<Utf8Segment, int> dict = ...;

Utf8String theString = u"hello world";
Utf8Segment segment = theString.AsMemory(0, 5); // u"hello"

if (dict.TryGetValue(segment, out int value))
{
    Console.WriteLine(value);
}

Utf8Span instances can be compared against each other:

Utf8Span data1 = ...;
Utf8Span data2 = ...;

int hashCode = data1.GetHashCode(); // Marvin32 hash

if (data1 == data2) { /* ordinal comparison of contents */ }

An alternative design that was considered was to introduce a type Char8 that would represent an 8-bit code unit - it would serve as the elemental type of Utf8String and its slices. However, ReadOnlyMemory<Char8> and ReadOnlySpan<Char8> were a bit unweildy for a few reasons.

First, there was confusion as to what ROS<Char8> actually meant when the developer could use ROS<byte> for everything. Was ROS<Char8> actually providing guarantees that ROS<byte> couldn't? (No.) When would I ever want to use a lone Char8 by itself rather than as part of a larger sequence? (You probably wouldn't.)

Second, it introduced a complication that if you had a ROM<Char8>, it couldn't be converted to a ROM<byte>. This impacted the ability to perform text manipulation and then act on the data in a binary fashion, such as sending it across the network.

Creating segment types

Segment types can be created safely from Utf8String backing objects. As mentioned earlier, we enforce that data in the UTF-8 segment types is well-formed. This implies that an instance of a segment type cannot represent data that has been sliced in the middle of a multibyte boundary. Calls to slicing APIs will throw an exception if the caller tries to slice the data in such a manner.

The Utf8Segment type introduces additional complexity in that it could be torn in a multi-threaded application, and that tearing may invalidate the well-formedness assumption by causing the torn segment to begin or end in the middle of a multi-byte UTF-8 subsequence. To resolve this issue, any instance method on Utf8Segment (including its projection to ROM<byte>) must first validate that the instance has not been torn. If the instance has been torn, an exception is thrown. This check is O(1) algorithmic complexity.

It is possible that the developer will want to create a Utf8Segment or Utf8Span instance from an existing buffer (such as a pooled buffer). There are zero-cost APIs to allow this to be done; however, they are unsafe because they easily allow the developer to violate invariants held by these types.

If the developer wishes to call the unsafe factories, they must maintain the following three invariants hold.

  1. The provided buffer (ROM<byte> or ROS<byte>) remains "alive" and immutable for the duration of the Utf8Segment or Utf8Span's existence. Whichever component receives a Utf8Segment or Utf8Span - however the instance has been created - must never observe that the underlying contents change or that dereferencing the contents might result in an AV or other undefined behavior.

  2. The provided buffer contains only well-formed UTF-8 data, and the boundaries of the buffer do not split a multibyte UTF-8 sequence.

  3. For Utf8Segment in particular, the caller must not create a Utf8Segment instance wrapped around a ROM<byte> in circumstances where the component which receives the newly created Utf8Segment might tear it. The reason for this is that the "check that the Utf8Segment instance was not torn across a multi-byte subsequence" protection is only reliable when the Utf8Segment instance is backed by a Utf8String. The Utf8Segment type makes a best effort to offer protection for other backing buffers, but this protection is not ironclad in those scenarios. This could lead to a violation of invariant (2) immediately above.

The type design here - including the constraints placed on segment types and the elimination of the Char8 type - also draws inspiration from the Go, Swift, and Rust communities.

public readonly ref struct Utf8Span
{
    public Utf8Span(Utf8String? value);

    // This "Unsafe" ctor wraps a Utf8Span around an arbitrary span. It is non-copying.
    // The caller must uphold Utf8Span's invariants: that it's immutable and well-formed
    // for the lifetime that any component might be consuming the Utf8Span instance.
    // Consumers (and Utf8Span's own internal APIs) rely on this invariant, and
    // violating it could lead to undefined behavior at runtime.

    [RequiresUnsafe]
    public static Utf8Span UnsafeCreateWithoutValidation(ReadOnlySpan<byte> buffer);

    // The equality operators and GetHashCode() operate on the underlying buffers.
    // Two Utf8Span instances containing the same data will return equal and have
    // the same hash code, even if they're referencing different memory addresses.

    [EditorBrowsable(EditorBrowsableState.Never)]
    [Obsolete("Equals(object) on Utf8Span will always throw an exception. Use Equals(Utf8Span) or == instead.")]
    public override bool Equals(object? obj);
    public bool Equals(Utf8Span other);
    public bool Equals(Utf8Span other, StringComparison comparison);
    public static bool Equals(Utf8Span left, Utf8Span right);
    public static bool Equals(Utf8Span left, Utf8Span right, StringComparison comparison);
    public override int GetHashCode();
    public int GetHashCode(StringComparison comparison);
    public static bool operator !=(Utf8Span left, Utf8Span right);
    public static bool operator ==(Utf8Span left, Utf8Span right);

    // Unlike Utf8String.GetPinnableReference, Utf8Span.GetPinnableReference returns
    // null if the span is zero-length. This is because we're not guaranteed that the
    // backing data has a null terminator at the end, so we don't know whether it's
    // safe to dereference the element just past the end of the span.

    public ReadOnlySpan<byte> Bytes { get; }
    public bool IsEmpty { get; }
    [EditorBrowsable(EditorBrowsableState.Never)]
    public ref readonly byte GetPinnableReference();

    // For the most part, Utf8Span's remaining APIs mirror APIs already on Utf8String.
    // There are some exceptions: methods like ToUpperInvariant have a non-allocating
    // equivalent that allows the caller to specify the buffer which should
    // contain the result of the operation. Like Utf8String, all APIs are assumed
    // Ordinal unless the API takes a parameter which provides otherwise.

    public static Utf8Span Empty { get; }

    public ReadOnlySpan<byte> Bytes { get; } // returns ROS<byte>, not custom enumerable
    public CharEnumerable Chars { get; }
    public RuneEnumerable Runes { get; }

    // Also allow iterating over extended grapheme clusters (not yet ready).
    // public GraphemeClusterEnumerable GraphemeClusters { get; }

    public int CompareTo(Utf8Span other);
    public int CompareTo(Utf8Span other, StringComparison comparison);

    public bool Contains(char value);
    public bool Contains(char value, StringComparison comparison);
    public bool Contains(Rune value);
    public bool Contains(Rune value, StringComparison comparison);
    public bool Contains(Utf8Span value);
    public bool Contains(Utf8Span value, StringComparison comparison);

    public bool EndsWith(char value);
    public bool EndsWith(char value, StringComparison comparison);
    public bool EndsWith(Rune value);
    public bool EndsWith(Rune value, StringComparison comparison);
    public bool EndsWith(Utf8Span value);
    public bool EndsWith(Utf8Span value, StringComparison comparison);

    public bool IsAscii();

    public bool IsEmptyOrWhiteSpace();

    public bool IsNormalized(NormalizationForm normalizationForm = NormalizationForm.FormC);

    public Utf8String Normalize(NormalizationForm normalizationForm = NormalizationForm.FormC);
    public int Normalize(Span<byte> destination, NormalizationForm normalizationForm = NormalizationForm.FormC);

    public Utf8Span this[Range range] { get; }

    public SplitResult SplitOn(char separator);
    public SplitResult SplitOn(char separator, StringComparison comparisonType);
    public SplitResult SplitOn(Rune separator);
    public SplitResult SplitOn(Rune separator, StringComparison comparisonType);
    public SplitResult SplitOn(Utf8String separator);
    public SplitResult SplitOn(Utf8String separator, StringComparison comparisonType);

    public SplitResult SplitOnLast(char separator);
    public SplitResult SplitOnLast(char separator, StringComparison comparisonType);
    public SplitResult SplitOnLast(Rune separator);
    public SplitResult SplitOnLast(Rune separator, StringComparison comparisonType);
    public SplitResult SplitOnLast(Utf8String separator);
    public SplitResult SplitOnLast(Utf8String separator, StringComparison comparisonType);

    public bool StartsWith(char value);
    public bool StartsWith(char value, System.StringComparison comparison);
    public bool StartsWith(Rune value);
    public bool StartsWith(Rune value, StringComparison comparison);
    public bool StartsWith(Utf8Span value);
    public bool StartsWith(Utf8Span value, StringComparison comparison);

    public int ToChars(Span<char> destination);

    public Utf8String ToLower(CultureInfo culture);
    public int ToLower(Span<byte> destination, CultureInfo culture);

    public Utf8String ToLowerInvariant();
    public int ToLowerInvariant(Span<byte> destination);

    public override string ToString();

    public Utf8String ToUpper(CultureInfo culture);
    public int ToUpper(Span<byte> destination, CultureInfo culture);

    public Utf8String ToUpperInvariant();
    public int ToUpperInvariant(Span<byte> destination);

    public Utf8String ToUtf8String();

    // Should we also have Trim* overloads that return a range instead
    // of the span directly? Does this actually enable any new scenarios?

    public Utf8Span Trim();
    public Utf8Span TrimStart();
    public Utf8Span TrimEnd();

    public bool TryFind(char value, out Range range);
    public bool TryFind(char value, StringComparison comparisonType, out Range range);
    public bool TryFind(Rune value, out Range range);
    public bool TryFind(Rune value, StringComparison comparisonType, out Range range);
    public bool TryFind(Utf8Span value, out Range range);
    public bool TryFind(Utf8Span value, StringComparison comparisonType, out Range range);

    public bool TryFindLast(char value, out Range range);
    public bool TryFindLast(char value, StringComparison comparisonType, out Range range);
    public bool TryFindLast(Rune value, out Range range);
    public bool TryFindLast(Rune value, StringComparison comparisonType, out Range range);
    public bool TryFindLast(Utf8Span value, out Range range);
    public bool TryFindLast(Utf8Span value, StringComparison comparisonType, out Range range);

    /*
     * HELPER NESTED STRUCTS
     */

    public readonly ref struct CharEnumerable { /* pattern match for 'foreach' */ }
    public readonly ref struct RuneEnumerable { /* pattern match for 'foreach' */ }

    public readonly ref struct SplitResult
    {
        private SplitResult();

        [EditorBrowsable(EditorBrowsable.Never)]
        public void Deconstruct(out Utf8Span before, out Utf8Span after);
    }
}

public readonly struct Utf8Segment : IComparable<Utf8Segment>, IEquatable<Utf8Segment>
{
    private readonly ReadOnlyMemory<byte> _data;

    public Utf8Span Span { get; }

    // Not all span-based APIs are present. APIs on Utf8Span that would
    // return a new Utf8Span (such as Trim) should be present here, but
    // other APIs that return bool / int (like Contains, StartsWith)
    // should only be present on the Span type to discourage heavy use
    // of APIs hanging directly off of this type.

    public override bool Equals(object? other); // ok to call
    public bool Equals(Utf8Segment other); // defaults to Ordinal
    public bool Equals(Utf8Segment other, StringComparison comparison);

    public override int GetHashCode(); // Ordinal
    public int GetHashCode(StringComparison comparison);

    // Caller is responsible for ensuring:
    // - Input buffer contains well-formed UTF-8 data.
    // - Input buffer is immutable and accessible for the lifetime of this Utf8Segment instance.
    public static Utf8Segment UnsafeCreateWithoutValidation(ReadOnlyMemory<byte> data);
}

Supporting types

Like StringComparer, there's also a Utf8StringComparer which can be passed into the Dictionary<,> and HashSet<> constructors. This Utf8StringComparer also implements IEqualityComparer<Utf8Segment>, which allows using Utf8Segment instances directly as the keys inside dictionaries and other collection types.

The Dictionary<,> class is also being enlightened to understand that these types have both non-randomized and randomized hash code calculation routines. This allows dictionaries instantiated with TKey = Utf8String or TKey = Utf8Segment to enjoy the same performance optimizations as dictionaries instantiated with TKey = string.

Finally, the Utf8StringComparer type has convenience methods to compare Utf8Span instances against one another. This will make it easier to compare texts using specific cultures, even if that specific culture is not the current thread's active culture.

public abstract class Utf8StringComparer : IComparer<Utf8Segment>, IComparer<Utf8String?>, IEqualityComparer<Utf8Segment>, IEqualityComparer<Utf8String?>
{
    private Utf8StringComparer(); // all implementations are internal

    public static Utf8StringComparer CurrentCulture { get; }
    public static Utf8StringComparer CurrentCultureIgnoreCase { get; }
    public static Utf8StringComparer InvariantCulture { get; }
    public static Utf8StringComparer InvariantCultureIgnoreCase { get; }
    public static Utf8StringComparer Ordinal { get; }
    public static Utf8StringComparer OrdinalIgnoreCase { get; }

    public static Utf8StringComparer Create(CultureInfo culture, bool ignoreCase);
    public static Utf8StringComparer Create(CultureInfo culture, CompareOptions options);
    public static Utf8StringComparer FromComparison(StringComparison comparisonType);

    public abstract int Compare(Utf8Segment x, Utf8Segment y);
    public abstract int Compare(Utf8String? x, Utf8String? y);
    public abstract int Compare(Utf8Span x, Utf8Span y);
    public abstract bool Equals(Utf8Segment x, Utf8Segment y);
    public abstract bool Equals(Utf8String? x, Utf8String? y);
    public abstract bool Equals(Utf8Span x, Utf8Span y);
    public abstract int GetHashCode(Utf8Segment obj);
    public abstract int GetHashCode(Utf8String obj);
    public abstract int GetHashCode(Utf8Span obj);
}

Manipulating UTF-8 data

CoreFX and Azure scenarios

Sample operations on arbitrary buffers

(Devs may want to perform these operations on arbitrary byte buffers, even if those buffers aren't guaranteed to contain valid UTF-8 data.)

These operations could be on the newly-introduced System.Text.Unicode.Utf8 static class. They would take ROS<byte> and Span<byte> as input parameters because they can operate on arbitrary byte buffers. Their runtime performance would be subpar compared to similar methods on Utf8String, Utf8Span, or other types where we can guarantee that no invalid data will be seen, as the APIs which operate on raw byte buffers would need to be defensive and would probably operate over the input in an iterative fashion rather than in bulk. One potential behavior could be skipping over invalid data and leaving it unchanged as part of the operation.

Sample Utf8StringBuilder implementation for private use

internal ref struct Utf8StringBuilder
{
    public void Append<T>(T value) where T : IUtf8Formattable;
    public void Append<T>(T value, string format, CultureInfo culture) where T : IUtf8Formattable;

    public void Append(Utf8String value);
    public void Append(Utf8Segment value);
    public void Append(Utf8Span value);

    // Some other Append methods, resize methods, etc.
    // Methods to query the length.

    public Utf8String ToUtf8String();

    public void Dispose(); // when done with the instance
}

// Would be implemented by numeric types (int, etc.),
// DateTime, String, Utf8String, Guid, other primitives,
// Uri, and anything else we might want to throw into
// interpolated data.
internal interface IUtf8Formattable
{
    void Append(ref Utf8StringBuilder builder);
    void Append(ref Utf8StringBuilder builder, string format, CultureInfo culture);
}

Code samples and metadata representation

The C# compiler could detect support for UTF-8 strings by looking for the existence of the System.Utf8String type and the appropriate helper APIs on RuntimeHelpers as called out in the samples below. If these APIs don't exist, then the target framework does not support the concept of UTF-8 strings.

Literals

Literal UTF-8 strings would appear as regular strings in source code, but would be prefixed by a u as demonstrated below. The u prefix would denote that the return type of this literal string expression should be Utf8String instead of string.

Utf8String myUtf8String = u"A literal string!";
// Normal ldstr to literal UTF-16 string in PE string table, followed by
// call to helper method which translates this to a UTF-8 string literal.
// The end result of these calls is that a Utf8String instance sits atop
// the stack.

ldstr "A literal string!"
call class System.Utf8String System.Runtime.CompilerServices.RuntimeHelpers.InitializeUtf8StringLiteral(string)

The u prefix would also be combinable with the @ prefix and the $ prefix (more on this below).

Additionally, literal UTF-8 strings must be well-formed Unicode strings.

// Below line would be a compile-time error since it contains ill-formed Unicode data.
Utf8String myUtf8String = u"A malformed \ud800 literal string!";

Three alternative designs were considered. One was to use RVA statics (through ldsflda) instead of literal UTF-16 strings (through ldstr) before calling a "load from RVA" method on RuntimeHelpers. The overhead of using RVA statics is somewhat greater than the overhead of using the normal UTF-16 string table, so the normal UTF-16 string literal table should still be the more optimized case for small-ish strings, which we believe to be the common case.

Another alternative considered was to introduce a new opcode ldstr.utf8, which would act as a UTF-8 equivalent to the normal ldstr opcode. This would be a breaking change to the .NET tooling ecosystem, and the ultimate decision was that there would be too much pain to the ecosystem to justify the benefit.

The third alternative considered was to smuggle UTF-8 data in through a normal UTF-16 string in the string table, then call a RuntimeHelpers method to reinterpret the contents. This would result in a "garbled" string for anybody looking at the raw IL. While that in itself isn't terrible, there is the possibility that smuggling UTF-8 data in this manner could result in a literal string which has ill-formed UTF-16 data. Not all .NET tooling is resilient to this. For example, xunit's test runner produces failures if it sees attributes initialized from literal strings containing ill-formed UTF-16 data. There is a risk that other tooling would behave similarly, potentially modifying the DLL in such a manner that errors only manifest themselves at runtime. This could result in difficult-to-diagnose bugs.

We may wish to reconsider this decision in the future. For example, if we see that it is common for developers to use large UTF-8 literal strings, maybe we'd want to dynamically switch to using RVA statics for such strings. This would lower the resulting DLL size. However, this would add extra complexity to the compilation process, so we'd want to tread lightly here.

Constant handling

class MyClass
{
    public const Utf8String MyConst = u"A const string!";
}
// Literal field initialized to literal UTF-16 value. The runtime doesn't care about
// this (modulo FieldInfo.GetRawConstantValue, which perhaps we could fix up), so
// only the C# compiler would need to know that this is a UTF-8 constant and that
// references to it should get the same (ldstr, call) treatment as stated above.

.field public static literal class System.Utf8String MyConst = "A const string!";

String concatenation

There would be APIs on Utf8String which mirror the string.Concat APIs. The compiler should special-case the + operator to call the appropriate overload n-ary overload of Concat.

Utf8String a = ...;
Utf8String b = ...;

Utf8String c = a + u", " + b; // calls Utf8String.Concat(...)

Since we expect use of Utf8String to be "deliberate" when compared to string (see the beginning of this document), we should consider that a developer who is using UTF-8 wants to stay in UTF-8 during concatenation operations. This means that if there's a line which involves the concatenation of both a Utf8String and a string, the final type post-concatenation should be Utf8String.

Utf8String a = ...;
string b = ...;

Utf8String concatFoo = a + b;
string concatBar = (object)a + b; // compiler can't statically determine that any argument is Utf8String

This is still open for discussion, as the behavior may be surprising to people. Another alternative is to produce a build warning if somebody tries to mix-and-match UTF-8 strings and UTF-16 strings in a single concatenation expression.

If string interpolation is added in the future, this shouldn't result in ambiguity. The $ interpolation operator will be applied to a literal Utf8String or a literal string, and that would dictate the overall return type of the operation.

Equality comparisons

There are standard == and != operators defined on the Utf8String class.

public static bool operator ==(Utf8String a, Utf8String b);
public static bool operator !=(Utf8String a, Utf8String b);

The C# compiler should special-case when either side of an equality expression is known to be a literal null object, and if so the compiler should emit a referential check against the null object instead of calling the operator method. This matches the if (myString == null) behavior that the string type enjoys today.

Additionally, equality / inequality comparisons between Utf8String and string should produce compiler warnings, as they will never succeed.

Utf8String a = ...;
string b = ...;

// Below line should produce a warning since it will end up being the equivalent
// of Object.ReferenceEquals, which will only succeed if both arguments are null.
// This probably wasn't what the developer intended to check.

if (a == b) { /* ... */ }

I attempted to define operator ==(Utf8String a, string b) so that I could slap [Obsolete] on it and generate the appropriate warning, but this had the side effect of disallowing the user to write the code if (myUtf8String == null) since the compiler couldn't figure out which overload of operator == to call. This was also one of the reasons I had opened https://github.com/dotnet/csharplang/issues/2340.

Marshaling behaviors

Like the string type, the Utf8String type shall be marshalable across p/invoke boundaries. The corresponding unmanaged type shall be LPCUTF8 (equivalent to a BYTE* pointing to null-terminated UTF-8 data) unless a different unmanaged type is specified in the p/invoke signature.

If a different [MarshalAs] representation is specified, the stub routine creates a temporary copy in the desired representation, performs the p/invoke, then destroys the temporary copy or allows the GC to reclaim the temporary copy.

class NativeMethods
{
    [DllImport]
    public static extern int MyPInvokeMethod(
        [In] Utf8String marshaledAsLPCUTF8,
        [In, MarshalAs(UnmanagedType.LPUTF8Str)] Utf8String alsoMarshaledAsLPCUTF8,
        [In, MarshalAs(UnmanagedType.LPWStr)] Utf8String marshaledAsLPCWSTR,
        [In, MarshalAs(UnmanagedType.BStr)] Utf8String marshaledAsBSTR);
}

If a Utf8String must be marshaled from native-to-managed (e.g., a reverse p/invoke takes place on a delegate which has a Utf8String parameter), the stub routine is responsible for fixing up invalid UTF-8 data before creating the Utf8String instance (or it may let the Utf8String constructor perform the fixup automatically).

Unmanaged routines must not modify the contents of any Utf8String instance marshaled across the p/invoke boundary. Utf8String instances are assumed to be immutable once created, and violating this assumption could cause undefined behaviors within the runtime.

There is no default marshaling behavior for Utf8Segment or Utf8Span since they are not guaranteed to be null-terminated. If in the future the runtime allows marshaling {ReadOnly}Span<T> across a p/invoke boundary (presumably as a non-null-terminated array equivalent), library authors may fetch the underlying ReadOnlySpan<byte> from the Utf8Segment or Utf8Span instance and directly marshal that span across the p/invoke boundary.

Automatic coercion of UTF-16 literals to UTF-8 literals

If possible, it would be nice if UTF-16 literals (not arbitrary string instances) could be automatically coerced to UTF-8 literals (via the ldstr / call routines mentioned earlier). This coercion would only be considered if attempting to leave the data as a string would have caused a compilation error. This could help eliminate some errors resulting from developers forgetting to put the u prefix in front of the string literal, and it could make the code cleaner. Some examples follow.

// String literal being assigned to a member / local of type Utf8String.
public const Utf8String MyConst = "A literal!";

public void Foo(string s);
public void Foo(Utf8String s);

public void FooCaller()
{
    // Calls Foo(string) since it's an exact match.
    Foo("A literal!");
}

public void Bar(object o);
public void Bar(Utf8String s);

public void BarCaller()
{
    // Calls Bar(object), passing in the string literal,
    // since it's the closest match.
    Bar("A literal!");
}

public void Baz(int i);
public void Baz(Utf8String s);

public void BazCaller1()
{
    // Calls Baz(Utf8String), passing in the UTF-8 literal,
    // since there's no closer match.
    Baz("A literal!");
}

public void BazCaller2(string someInput)
{
    // Compiler error. The input isn't a literal, so no auto-coercion
    // takes place. Dev should call Baz(new Utf8String(someInput)).
    Baz(someInput);
}

public void Quux<T>(ReadOnlySpan<T> value);
public void Quux(Utf8String s);

public void QuuxCaller()
{
    // Calls Quux<char>(ReadOnlySpan<char>), passing in the string literal,
    // since string satisfies the constraints.
    Quux("A literal!");
}

public void Glomp(Utf8Span value);

public void GlompCaller()
{
    // Calls Glomp(Utf8Span), passing in the UTF-8 literal, since there's
    // no closer match and Utf8String can be implicitly cast to Utf8Span.
    Glomp("A literal!");
}

UTF-8 String interpolation

The string interpolation feature is undergoing significant churn (see https://github.com/dotnet/csharplang/issues/2302). I envision that when a final design is chosen, there would be a UTF-8 counterpart for symmetry. The internal IUtf8Formattable interface as proposed above is being designed partly with this feature in mind in order to allow single-allocation Utf8String interpolation.

ustring contextual language keyword

For simplicity, we may want to consider a contextual language keyword which corresponds to the System.Utf8String type. The exact name is still up for debate, as is whether we'd want it at all, but we could consider something like the below.

Utf8String a = u"Some UTF-8 string.";

// 'ustring' and 'System.Utf8String' are aliases, as shown below.

ustring b = a;
Utf8String c = b;

The name ustring is intended to invoke "Unicode string". Another leading candidate was utf8. We may wish not to ship with this keyword support in v1 of the Utf8String feature. If we opt not to do so we should be mindful of how we might be able to add it in the future without introducing breaking changes.

An alternative design to use a u suffix instead of a u prefix. I'm mostly impartial to this, but there is a nice symmetry to having the characters u, $, and @ all available as prefixes on literal strings.

We could also drop the u prefix entirely and rely solely on type targeting:

ustring a = "Literal string type-targeted to UTF-8.";
object b = (ustring)"Another literal string type-targeted to UTF-8.";

This has implications for string interpolation, as it wouldn't be possible to prepend both the (ustring) coercion hint and the $ interpolation operator simultaneously.

Switching and pattern matching

If a value whose type is statically known to be Utf8String is passed to a switch statement, the corresponding case statements should allow the use of literal Utf8String values.

Utf8String value = ...;

switch (value)
{
    case u"Some literal": /* ... */
    case u"Some other literal": /* ... */
    case "Yet another literal": /* target typing also works */
}

Since pattern matching operates on input values of arbitrary types, I'm pessimistic that pattern matching will be able to take advantage of target typing. This may instead require that developers specify the u prefix on Utf8String literals if they wish such values to participate in pattern matching.

A brief interlude on indexers and IndexOf

Utf8String and related types do not expose an elemental indexer (this[int]) or a typical IndexOf method because they're trying to rid the developer of the notion that bytewise indices into UTF-8 buffers can be treated equivalently as charwise indices into UTF-16 buffers. Consider the naïve implementation of a typical "string split" routine as presented below.

void SplitString(string source, string target, StringComparison comparisonType, out string beforeTarget, out string afterTarget)
{
    // Locates 'target' within 'source', splits on it, then populates the two out parameters.
    // ** NOTE ** This code has a bug, as will be explained in detail below.

    int index = source.IndexOf(target, comparisonType);
    if (index < 0) { throw new Exception("Target string not found!"); }

    beforeTarget = source.Substring(0, index);
    afterTarget = source.Substring(index + target.Length, source.Length - index - target.Length);
}

One subtlety of the above code is that when culture-sensitive or case-insensitive comparers are used (such as OrdinalIgnoreCase in the above example), the target string doesn't have to be an exact char-for-char match of a sequence present in the source string. For example, consider the UTF-16 string "GREEN" ([ 0047 0052 0045 0045 004E ]). Performing an OrdinalIgnoreCase search for the substring "e" ([ 0065 ]) will result in a match, as 'e' (U+0065) and 'E' (U+0045) compare as equal under an OrdinalIgnoreCase comparer.

As another example, consider the UTF-16 string "preſs" ([ 0070 0072 0065 017F 0073 ]), whose fourth character is the Latin long s 'ſ' (U+017F). Performing an OrdinalIgnoreCase search for the substring "S" ([ 0053 ]) will result in a match, as 'ſ' (U+017F) and 'S' (U+0053) compare as equal under an OrdinalIgnoreCase comparer.

There are also scenarios where the length of the match within the search string might not be equal to the length of the target string. Consider the UTF-16 string "encyclopædia" ([ 0065 006E 0063 0079 0063 006C 006F 0070 00E6 0064 0069 0061 ]), whose ninth character is the ligature 'æ' (U+00E6). Performing an InvariantCultureIgnoreCase search for the substring "ae" ([ 0061 0065 ]) will result in a match at index 8, as "æ" ([ 00E6 ]) and "ae" ([ 0061 0065 ]) compare as equal under an InvariantCultureIgnoreCase comparer.

This result is interesting and should give us pause. Since "æ".Length == 1 and "ae".Length == 2, the arithmetic at the end of the method will actually result in the wrong substrings being returned to the caller.

beforeTarget = source.Substring(0, 8 /* index */); // = "encyclop"
afterTarget = source.Substring(
    10 /* index + target.Length */,
    2 /* source.Length - index - target.Length */); // = "ia" (expected "dia"!)

Due to the nature of UTF-16 (used by string), when performing an Ordinal or an OrdinalIgnoreCase comparison, the length of the matched substring within the source will always have a char count equal to target.Length. The length mismatch as demonstrated by "encyclopædia" above can only happen with a culture-sensitive comparer or any of the InvariantCulture comparers.

However, in UTF-8, these same guarantees do not hold. Under UTF-8, only when performing an Ordinal comparison is there a guarantee that the length of the matched substring within the source will have a byte count equal to the target. All other comparers - including OrdinalIgnoreCase - have the behavior that the byte length of the matched substring can change (either shrink or grow) when compared to the byte length of the target string.

As an example of this, consider the string "preſs" from earlier, but this time in its UTF-8 representation ([ 70 72 65 C5 BF 73 ]). Performing an OrdinalIgnoreCase for the target UTF-8 string "S" ([ 53 ]) will match on the ([ C5 BF ]) portion of the source string. (This is the UTF-8 representation of the letter 'ſ'.) To properly split the source string along this search target, the caller need to know not only where the match was, but also how long the match was within the original source string.

This fundamental problem is why Utf8String and related types don't expose a standard IndexOf function or a standard this[int] indexer. It's still possible to index directly into the underlying byte buffer by using an API which projects the data as a ROS<byte>. But for splitting operations, these types instead offer a simpler API that performs the split on the caller's behalf, handling the length adjustments appropriately. For callers who want the equivalent of IndexOf, the types instead provide TryFind APIs that return a Range instead of a typical integral index value. This Range represents the matching substring within the original source string, and new C# language features make it easy to take this result and use it to create slices of the original source input string.

This also addresses feedback that was given in a previous prototype: users weren't sure how to interpret the result of the IndexOf method. (Is it a byte count? Is it a char count? Is it something else?) Similarly, there was confusion as to what parameters should be passed to a this[int] indexer or a Substring(int, int) method. By having the APIs promote use of Range and related C# language features, this confusion should subside. Power developers can inspect the Range instance directly to extract raw byte offsets if needed, but most devs shouldn't need to query such information.

API usage samples

Scenario: Split an incoming string of the form "LastName, FirstName" into individual FirstName and LastName components.

// Using Utf8String input and producing Utf8String instances
void SplitSample(ustring input)
{
    // Method 1: Use the SplitOn API to find the ',' char, then trim manually.

    (ustring lastName, ustring firstName) = input.Split(',');
    if (firstName is null) { /* ERROR: no ',' detected in input */ }

    lastName = lastName.Trim();
    firstName = firstName.Trim();

    // Method 2: Use the SplitOn API to find the ", " target string, assuming no trim needed.

    (ustring lastName, ustring firstName) = input.Split(u", ");
    if (firstName is null) { /* ERROR: no ", " detected in input */ }
}

// Using Utf8Span input and producing Utf8Span instances
void SplitSample(Utf8Span input)
{
    // Method 1: Use the SplitOn API to find the ',' char, then trim manually.

    (Utf8Span lastName, Utf8Span firstName) = input.Split(',');
    lastName = lastName.Trim();
    firstName = firstName.Trim();
    if (firstName.IsEmpty) { /* ERROR: trailing ',', or no ',' detected in input */ }

    // Method 2: Use the SplitOn API to find the ", " target string, assuming no trim needed.

    (Utf8Span lastName, Utf8Span firstName) = input.Split(", ");
    if (firstName.IsEmpty) { /* ERROR: trailing ", ", or no ", " detected in input */ }
}

Additionally, the SplitResult struct returned by Utf8Span.Split implements both a standard IEnumerable<T> pattern and the C# deconstruct pattern, which allows it to be used separately from enumeration for simple cases where only a small handful of values are returned.

Utf8Span str = ...;

// The result of Utf8Span.Split can be used in an enumerator

foreach (Utf8Span substr in str.Split(','))
{
    /* operate on substr */
}

// Or it can be used in tuple deconstruction
// (See docs for description of behavior for each arity.)

(Utf8Span before, Utf8Span after) = str.Split(',');
(Utf8Span part1, Utf8Span part2, Utf8Span part3, ...) = str.Split(',');

Scenario: Split a comma-delimited input into substrings, then perform an operation with each substring.

// Using Utf8String input and producing Utf8String instances
// The Utf8Span code would look  identical (sub. 'Utf8Span' for 'ustring')

void SplitSample(ustring input)
{
    while (input.Length > 0)
    {
        // 'TryFind' is the 'IndexOf' equivalent. It returns a Range instead
        // of an integer index because there's no this[int] indexer on Utf8String.

        if (!input.TryFind(',', out Range matchedRange))
        {
            // The remainder of the input string is empty, but no comma
            // was found in the remaining portion. Process the remainder
            // of the input string, then finish.

            ProcessValue(input);
            break;
        }

        // We found a comma! Substring and process.
        // The 'matchedRange' local contains the range for the ',' that we found.

        ProcessValue(input[..matchedRange.Start]); // fetch segment to the left of the comma, then process it
        input = input[matchedRange.End..]; // set 'input' to the remainder of the input string and loop
    }

    // Could also have an IEnumerable<ustring>-returning version if we wanted, I suppose.
}

Miscellaneous topics and open questions

What about comparing UTF-16 and UTF-8 data?

Currently there is a set of APIs Utf8String.AreEquivalent which will decode sequences of UTF-16 and UTF-8 data and compare them for ordinal equality. The general code pattern is below.

ustring a = ...;
string b = ...;

// The below line fails to compile because there's no operator==(Utf8String, string) defined.

bool result = (a == b);

// The below line is probably what the developer intended to write.

bool result = ustring.AreEquivalent(a, b);

// The below line should compile since literal strings can be type targeted to Utf8String.

bool result = (a == "Hello!");

Do we want to add an operator==(Utf8String, string) overload which would allow easy == comparison of UTF-8 and UTF-16 data? There are three main downsides to this which caused me to vote no, but I'm open to reconsideration.

  1. The compiler would need to special-case if (myUtf8String == null), which would now be ambiguous between the two overloads. (If the compiler is already special-casing null checks, this is a non-issue.)

  2. The performance of UTF-16 to UTF-8 comparison is much worse than the performance of UTF-16 to UTF-16 (or UTF-8 to UTF-8) comparison. When the representation is the same on both sides, certain shortcuts can be implemented to avoid the O(n) comparison, and even the O(n) comparison itself can be implemented as a simple memcmp operation. When the representations are heterogeneous, the opportunity for taking shortcuts is much more restricted, and the O(n) comparison itself has a higher constant factor. Developers might not expect such a performance characteristic from an equality operator.

  3. Comparing a Utf8String against a literal string would no longer go through the fast path, as target typing would cause the compiler to emit a call to operator==(Utf8String, string) instead of operator==(Utf8String, Utf8String). The comparison itself would then have the lower performance described by bullet (2) above.

One potential upside to having such a comparison is that it would prevent developers from using the antipattern if (myUtf8String.ToString() == someString), which would result in unnecessary allocations. If we are concerned about this antipattern one way to address it would be through a Code Analyzer.

What if somebody passes invalid data to the "skip validation" factories?

When calling the "unsafe" APIs, callers are fully responsible for ensuring that the invariants are maintained. Our debug builds could double-check some of these invariants (such as the initial Utf8String creation consisting only of well-formed data). We could also consider allowing applications to opt-in to these checks at runtime by enabling an MDA or other diagnostic facility. But as a guiding principle, when "unsafe" APIs are called the Framework should trust the developer and should have as little overhead as possible.

Consider consolidating the unsafe factory methods under a single unsafe type.

This would prevent pollution of the type's normal API surface and could help write tools which audit use of a single "unsafe" type.

Some of the methods may need to be extension methods instead of normal static factories. (Example: Unsafe slicing routines, should we choose to expose them.)

Potential APIs to enlighten

System namespace

Include Utf8String / Utf8Span overloads on Console.WriteLine. Additionally, perhaps introduce an API Console.ReadLineUtf8.

System.Data.* namepace

Include generalized support for serializing Utf8String properties as a primitive with appropriate mapping to nchar or nvarchar.

System.Diagnostics.* namespace

Enlighten EventSource so that a caller can write Utf8String / Utf8Span instances cheaply. Additionally, some types like ActivitySpanId already have ROS<byte> ctors; overloads can be introduced here.

System.Globalization.* namespace

The CompareInfo type has many members which operate on string instances. These should be spanified foremost, and Utf8String / Utf8Span overloads should be added. Good candidates are Compare, GetHashCode, IndexOf, IsPrefix, and IsSuffix.

The TextInfo type has members which should be treated similarly. ToLower and ToUpper are good candidates. Can we get away without enlightening ToTitleCase?

System.IO.* namespace

BinaryReader and BinaryWriter should have overloads which operate on Utf8String and Utf8Span. These overloads could potentially be cheaper than the normal string / ROS<char> based overloads, since the reader / writer instances may in fact be backed by UTF-8 under the covers. If this is the case then writing is simple projection, and reading is validation (faster than transcoding).

File: WriteAllLines, WriteAllText, AppendAllText, etc. are good candidates for overloads to be added. On the read side, there's ReadAllTextUtf8 and ReadAllLinesUtf8.

TextReader.ReadLine and TextWriter.Write are also good candidates to overload. This follows the same general premise as BinaryReader and BinaryWriter as mentioned above.

Should we also enlighten SerialPort or GPIO APIs? I'm not sure if UTF-8 is a bottleneck here.

System.Net.Http.* namespace

Introduce Utf8StringContent, which automatically sets the charset header. This type already exists in the System.Utf8String.Experimental package.

System.Text.* namespace

UTF8Encoding: Overload candidates are GetChars, GetString, and GetCharCount (of Utf8String or Utf8Span). These would be able to skip validation after transcoding as long as the developer hasn't subclassed the type.

Rune: Add ToUtf8String API. Add IsDefined API to query the OS's NLS tables (could help with databases and other components that need to adhere to strict case / comparison processing standards).

TextEncoder: Add Encode(Utf8String): Utf8String and FindFirstIndexToEncode(Utf8Span): Index. This is useful for HTML-escaping, JSON-escaping, and related operations.

Utf8JsonReader: Add read APIs (GetUtf8String) and overloads to both the ctor and ValueTextEquals.

JsonEncodedText: Add an EncodedUtf8String property.

Regex is a bit of a special case because there has been discussion about redoing the regex stack all-up. If we did proceed with redoing the stack, then it would make sense to add first-class support for UTF-8 here.

benaadams commented 6 years ago

Even though byte / charu8 is the underlying elemental type of Utf8String, none of the APIs outside of the constructor actually take those types as input. The input parameter types to IndexOf and similar APIs is UnicodeScalar, which represents an arbitrary Unicode scalar value and can be 1 - 4 code units wide when transcoded to UTF-8.

Does that mean

var ss = s.Substring(s.IndexOf(','));

Would be a double traversal? i.e. any use of IndexOf would lead to a double traversal for its return value to be meaningful?

GrabYourPitchforks commented 6 years ago

Yes, I know this is dated from the future! :) It's our agenda and review doc for the in-person meeting before it goes to wider community review. Not everything is captured here, especially things related to runtime interaction.

GrabYourPitchforks commented 6 years ago

@benaadams No, it's a single traversal, just like if s were typed as System.String in your example. The IndexOf is O(n) up to the first found ',' character (using a vectorized search if available), and the Substring is O(n) from the indexed position to the end of the string. So the total number of bytes observed is index /* IndexOf */ + (Length - index) /* memcpy */ = Length = single traversal.

benaadams commented 6 years ago

But if IndexOf is returning the number of UnicodeScalars which can be 1-4 bytes; passing that int return value into Substring doesn't it then have to rescan from the start of the Utf8String to find that start position? i.e. IndexOf isn't returning (int scalarPosition, int byteOffset)

GrabYourPitchforks commented 6 years ago

APIs that operate on indices (like IndexOf, Substring, etc.) go by code unit count, not scalar count.

(I get that it might be confusing since enumeration of Utf8String instances goes by scalar, not by code unit, so now we have a disparity on the type. That's why I'd proposed as an open question that maybe we kill the enumerator entirely and just have Bytes and Scalars properties, which removes the disparity.)

stephentoub commented 6 years ago

Thanks, Levi. Some questions/comments:

  1. Should be straightforward and O(1) to create a Utf8String instance from an existing String / ReadOnlySpan or from a ReadOnlySpan coming in from the wire.

I don't understand how this is possible. With Utf8String as a reference type, getting the data into it will necessitate a memcpy at a minimum, which is not O(1).

Must allow querying total length (in code units) as O(1) operation.

I would expect a requirement would also be being able to query the total length in bytes in O(1) (which is also possible with string).

The five requirements below are drawn from String

This is already making some trade-offs. If I've read the data off the wire, I already have it in some memory, which I can then process as a ReadOnlySpan<byte>. To use it as a Utf8String, I then need to allocate and copy. So we're trading off usability for perf. I'm a bit surprised that's the right trade-off for the target audience, but the doc also doesn't specify who the target developers are, provide example scenarios for where/how this will be used, etc.

public ReadOnlySpan Bytes { get; } public ReadOnlyMemory AsMemory();

Why is the to-memory conversion called AsMemory but the to-span conversion called Bytes?

public bool Contains(UnicodeScalar value);

I'm surprised not to see overloads of methods like Contains (IndexOf, EndsWith, etc.) that accept string or char. For char, even if you add an implicit cast from char to UnicodeScalar, we just had that discussion about not relying on implicit casts from a usability perspective in cases like this. And for string, with the currently defined methods someone would need to actually convert a string to a Utf8String, which is not cheap, in order to call these methods.

public int IndexOfAny(ReadOnlySpan value); public int LastIndexOfAny(ReadOnlySpan value);

string.{Last}IndexOfAny calls this argument anyOf.

public Utf8String ToLowerInvariant(); public Utf8String ToUpperInvariant();

Presumably Utf8String will have culture support and will also have ToLower/Upper methods that are culture-sensitive?

public int IndexOf(UnicodeScalar value); public int IndexOf(UnicodeScalar value, int startIndex);

What does the return value mean? Is that the number of the byte offset of the UnicodeScalar, or is it the number of the UnicodeScalar? Similarly, for startIndex. Assuming it's the number of UnicodeScalars, if I want to get Bytes and index into it starting at this UnicodeScalar, how do I convert that UnicodeScalar-offset to a byte offset?

Once culture support comes online, we should add CompareTo and related APIs.

From a design discussion perspective, I would think we'd want this outline to represent the ultimate shape we want, and the implementations can throw NotImplementedException until the functionality is available (before it ships).

public readonly struct UnicodeScalar

What's the plan for integration of this with the existing unicode support in .NET? For example, how do I get a System.Globalization.UnicodeCategory for one of these?

public readonly struct Utf8StringSegment

Similar questions related to the APIs on Utf8String.

And, presumably we wouldn't define any APIs (outside of Utf8String/Utf8StringSegment) that accept a Utf8String, instead accepting a Utf8StringSegment, since the former can cheaply convert to the latter but not vice versa?

For me, it also begs the question why do we need both? If we're going to have Utf8StringSegment, presumably that becomes the thing that most APIs would be written in terms of, because it can cheaply represent both the whole and slices. And once you have that, which effectively has the same surface area as Utf8String, why not just make it Utf8String, still as a struct, and get rid of the class-equivalent and duplication. It can then be constructed from a byte[] or a ReadOnlyMemory<byte> without any extra allocation or copying, can be cheaply sliced, etc. Utf8StringSegment (when named Utf8String) is then essentially as a nice wrapper / package for a lot of the functionality that exists in System.Memory as static methods.

n.b. This type is not pinnable because we cannot guarantee null termination.

I don't see why we'd place this restriction. Arrays don't guarantee null termination but are pinnable. Lots of types don't guarantee null termination but are pinnable.

// Pass a Utf8String instance across a p/invoke boundary

I would hope that before or as part of enabling this, we add support for Span<T> and ReadOnlySpan<T>. We still have debt to be paid down there and should address that before adding this as well.

Culture-aware processing code is currently implemented in terms of UTF-16 across all platforms. We don't expect this to change appreciably in the near future, which means that any operations which use culture data will almost certainly require two transcoding steps, making them expensive for UTF-8 data.

I didn't understand this part. Don't both Windows and ICU provide UTF8-based support in addition to the UTF16-based support that's currently being used?

Other stuff

Equivalents for String.Format?

whoisj commented 6 years ago

Don't both Windows and ICU provide UTF8-based support in addition to the UTF16-based support that's currently being used?

Not that I know of. Windows is, with very good legacy reasons, very UTF-16/UCS-2 focused.

whoisj commented 6 years ago

What about Equals(string other) or CompareTo(string other) ?

Seems like not implementing this would make it difficult for existing ecosystems to adopt this type.

KrzysztofCwalina commented 6 years ago
  1. The proposal lists servers and IoT as main scenarios. I think we need to add ML.NET. They explicitly requested UTF8 string support.
  2. The ML.NET team requires allocation free slicing. I am not sure if they need the slices to be heap-friendly or not. Something you should research.
  3. It would be good to drill into reasons for each of the pri 0 requirements. They all start with "must" and some are very limiting.
  4. I think the requirements should include slicing (even if we decide that slices are a different type and/or not heapable). Non-allocating slicing is a must have for high performance string manipulation.
  5. As a validation exercise, it would be good to rewrite ASP.NET platform server using this string (the code now uses custom AsciiString) and see if we can keep the same performance.
  6. EndsWith (and all similar operations) should have overloads that take ReadOnlySpan<some_type>, and C# should support conveniently creating literals of this span on the stack, e.g. (pseudocode): myString.EndsWith(stackalloc u8"World!"). Currently all the APIs that Utf8String (which allocates) and scalar (which is a single "char", i.e. not super useful).
  7. In the language support section you state that a literal assignment to Utf8String will result in conversion. Why? We should do target typing in the case you outline and avoid any conversions at runtime.
  8. Nit: I find the "u8" prefix super ugly.
  9. We use ReadOnlySpan<Char> as a representation of a slice of UTF16 string. You are proposing we use Utf8StringSegment. Is the discrepancy ok?
  10. Re Open Question #1: I don't think doing LINQ over scalars is a good practice.
nil4 commented 6 years ago

The signature public Utf8String[] Split(Utf8String separator) implies a lot of allocations and memory copies.

First, an array must be allocated for the return value.

Then, each element in the array must be a copy of each match, into a newly-allocated buffer, as Utf8String mandates null-termination but the input will not have nulls after each separator.

If I understand this correctly, except for the trivial case when the separator is not present at all, this signature would basically require copying the whole input string.

Would it make sense to return a custom enumerator of Utf8StringSegment instead, similar to SplitByScalarEnumerator or SplitBySubstringEnumerator?

svick commented 6 years ago

I think the biggest issue with the proposed API is confusion between UTF8 code units and Unicode scalar values, especially when it comes to lengths and indexes. Would it make sense to alleviate that confusion by more explicit names, like ByteLength instead of Length or startByteIndex instead of startIndex?


[EditorBrowsable(EditorBrowsableState.Never)]
public static Utf8String DangerousCreateWithoutValidation(ReadOnlySpan<byte> value);

Is EditorBrowsableState.Never the right way to hide dangerous methods? I don't like it, because it means such methods are hard to use, when I think the actual goal is to limit their discoverability, not their usability. Wouldn't putting them into a separate type be a better solution, similar to how dangerous Span APIs were put into the MemoryMarshal type?


One potential workaround is to make the JIT recognize a ldstr opcode immediately followed by a newobj Utf8String(string) opcode. This pattern can be special-cased to behave similarly to the standalone ldstr today, where the address of the literal String (or Utf8String) object is known at JIT time and a single mov reg, imm instruction is generated.

Would this mean that if I write new Utf8String("foo"), which would produce the same sequence of opcodes, it might not actually create a new instance of Utf8String? I think that would be very confusing, since it's not how any other type behaves, not even string. It would also be a violation of the C# specification, which says that for a class, new has to allocate a new instance:

The run-time processing of an object_creation_expression of the form new T(A), […] consists of the following steps:

  • If T is a class_type:
    • A new instance of class T is allocated. […]

What is the relationship between UnicodeScalar and Rune (https://github.com/dotnet/corefx/issues/24093)?


We can also consider introducing a type StringSegment which is the String-backed analog of this type.

There was an issue about creating StringSegment in corefx, which was closed a month ago, with the justification that ReadOnlyMemory<char> and ReadOnlySpan<char> are good enough: https://github.com/dotnet/corefx/issues/20378. Does that mean it's now on the table again?


The code comments on the StringSegment type go into much more detail on the benefits of this type when compared to ReadOnlyMemory<T> / ReadOnlySpan<T>.

Where can I find those comments? I didn't find the StringSegment type in any dotnet repo.


More generally, with this proposal we will have: string, char[], Span<char>, ReadOnlySpan<char>, Memory<char>, ReadOnlyMemory<char>, Utf8String, byte[], Span<byte>, ReadOnlySpan<byte>, Memory<byte> and ReadOnlyMemory<byte>. Do we really need Utf8StringSegment as yet another string-like type?

GrabYourPitchforks commented 6 years ago

I don't understand how this is possible. With Utf8String as a reference type, getting the data into it will necessitate a memcpy at a minimum, which is not O(1).

Yes, this is a typo.

I would expect a requirement would also be being able to query the total length in bytes in O(1) (which is also possible with string).

This is possible via Utf8String.Length or Utf8String.Bytes.Length, both of which return the byte count.

I'm surprised not to see overloads of methods like Contains (IndexOf, EndsWith, etc.) that accept string or char.

I struggled with this, and the reason I ultimately decided not to include it is because I think the majority of calls to these methods involve searching for literal substrings, and I'd rather rely on a one-time compiler conversion of the search target from UTF-16 to UTF-8 than a constantly-reoccurring runtime conversion from UTF-16 to UTF-8. I'm concerned that the presence of these overloads would encourage callers to inadvertently use a slow path that requires transcoding. We can go over this in Friday's discussion.

What's the plan for integration of [UnicodeScalar] with the existing unicode support in .NET?

I had planned APIs like UnicodeScalar.GetUnicodeCategory() in a future release, but we can go over them in Friday's meeting.

We use ReadOnlySpan as a representation of a slice of UTF16 string. You are proposing we use Utf8StringSegment. Is the discrepancy ok?

Check the comment at the top of https://github.com/dotnet/corefxlab/blob/utf8string/src/System.Text.Utf8/System/Text/StringSegment.cs. It explains in detail why I think this type provides significant benefits that we can't get simply from using ReadOnlySpan<char>.

It would also be a violation of the C# specification, which says that for a class, new has to allocate a new instance.

We do violate the specification in a few cases. For instance, new String(new char[0]) returns String.Empty. Not a new string that happens to be equivalent to String.Empty - the actual String.Empty instance itself. Similarly, the Roslyn compiler can sometimes optimize new statements away. See for example https://github.com/dotnet/roslyn/commit/13adbac980ba771d8128449476b6b00021cde203.

What is the relationship between UnicodeScalar and Rune (dotnet/corefx#24093)?

UnicodeScalar is validated: it is contractually guaranteed to represent a value in the range U+0000..U+D7FF or U+E000..U+10FFFF. Scalars have unique transcodings to UTF-8 and UTF-16 code unit sequences. Such transcoding operations are guaranteed always to succeed. Rune (which is not in this proposal) wraps a 32-bit integer which is ostensibly a Unicode code point value but which is not required to be valid. This means that developers consuming invalid Rune instances must be prepared for some operations on those instances to fail.

svick commented 6 years ago

@GrabYourPitchforks

For instance, new String(new char[0]) returns String.Empty. Not a new string that happens to be equivalent to String.Empty - the actual String.Empty instance itself.

I didn't know that, interesting.

Similarly, the Roslyn compiler can sometimes optimize new statements away. See for example https://github.com/dotnet/roslyn/commit/13adbac980ba771d8128449476b6b00021cde203.

As far as I can tell, that commit is about Span<T>, which is a struct, so it doesn't violate the C# specification.

UnicodeScalar is validated: it is contractually guaranteed to represent a value in the range U+0000..U+D7FF or U+E000..U+10FFFF. […] Rune (which is not in this proposal) wraps a 32-bit integer which is ostensibly a Unicode code point value but which is not required to be valid.

That doesn't sound like a good enough reason to have two different types to me, especially since you can create an invalid UnicodeScalar. Maybe the two groups could work together to create a single type for representing Unicode scalar values?

GrabYourPitchforks commented 6 years ago

As far as I can tell, that commit is about Span, which is a struct, so it doesn't violate the C# specification.

new byte[] { ... } isn't a struct type. :)

That doesn't sound like a good enough reason to have two different types to me

This proposal assumes that Rune is never committed. So there's only one type in the end.

whoisj commented 6 years ago

I see that it's already committed, but can I just go on record as saying that UnicodeScalar is just a plain terrible name? It really is. It's long, it's generic enough to mean nearly nothing, and it is not even a term the Unicode group uses. I had the same complaints about Rune (with the exception that Rune is at least short`).

This type really ought to be named Character or CodePoint.

I'm mostly OK with the rest of it, though it would be nice if .Split didn't have to allocate quite as much. The underlying data is already read-only - can't Span<T> be used here or something?

svick commented 6 years ago

@whoisj

I see that it's already committed, but can I just go on record as saying that UnicodeScalar is just a plain terrible name? It really is. It's long, it's generic enough to mean nearly nothing, and it is not even a term the Unicode group uses.

"Unicode Scalar Value" is the term Unicode uses for this.

This type really ought to be named Character or CodePoint.

"Character" doesn't really mean anything (Unicode lists 4 different meanings) and would be easily confused with System.Char/char.

"Code Point" is closer, but that term includes invalid Unicode Scalar Values (the range from U+D800 to U+DFFF).

benaadams commented 6 years ago

It's long, ...

The question is, what would the C# keyword be? (Int32 vs int); something like uchar is short 😉 or nchar to match databases

whoisj commented 6 years ago

The question is, what would the C# keyword be? (Int32 vs int); something like uchar is short 😉 or nchar to match databases

This.

Will there be a language word for the type? If there is, you can call the type ThatUnicodeValueWhichNobodyCouldAgreeOnAGoodNameForSoThisIsIt for all I care. I vote for c8 but I also like Rust. Keeping C# in mind, uchar seems the like to no-brainer to me.

@svick yeah, I know that "chartacter" is nearly meaningless hence my suggesting it. I prefer "code point" because how on Earth are you going to prevent me from writing invalid values to a UnicodeScalar's memory? Preventing unsafe is a recipe for a performance disaster; and making unsafe (the real meaning of the word) assumptions about what values a block of memory can contain will lead to fragile and exploitable software design.

GrabYourPitchforks commented 6 years ago

how on Earth are you going to prevent me from writing invalid values to a UnicodeScalar's memory?

Nobody's stopping you. In fact, there's a public static factory that skips validation and allows you to create such an invalid value. But if you do this you're now violating the contractual guarantees offered by the type, I'd recommend not doing this. :)

To be clear, creating an invalid UnicodeScalar won't AV the process or anything quite so dire. But it could make the APIs behave in very strange and unexpected manners, leading to errors on the consumption side. For example, UnicodeScalar.Utf8SequenceLength could return -17 if constructed from invalid input. Such are the consequences of violating invariants.

Unlike the UnicodeScalar type, the Utf8String type specifically does not offer a contractual guarantee that instances of the type contain only well-formed UTF-8 sequences.

whoisj commented 6 years ago

In fact, there's a public static factory that skips validation and allows you to create such an invalid value.

Sure, great, but a lot of the data being read into these structures will be coming from external sources. Very happy to hear that there's no validation steps being taking as the data is read in (because it would be horribly expensive), but still very concerned about:

But if you do this you're now violating the contractual guarantees offered by the type

BUT there is no guarantee - you've said so in your previous statement. There's an assumption, but no guarantee; so let's be careful how we describe this.

GrabYourPitchforks commented 6 years ago

The Utf8String and UnicodeScalar types make different contractual guarantees. I'll try to clarify them.

The Utf8String type encourages but does not require the caller to provide it a string consisting of only valid UTF-8 sequences. All APIs hanging off it have well-defined behaviors even in the face of invalid input. For example, enumerating scalars over an ill-formed Utf8String instance will return U+FFFD when an invalid subsequence is encountered. (Not just that, but the number of bytes we skip in the face of an invalid subsequence is also well-defined and predictable.) This extends to ToUpperInvariant() / ToLowerInvariant() and other manipulation APIs. Their behavior is well-defined even in the face of invalid input.

Exception: If you construct a Utf8String instance and use unsafe code or private reflection to manipulate its data after it has been constructed, the APIs have undefined behavior.

The UnicodeScalar type requires construction from a Unicode scalar value. The API behavior is only well-defined when the instance itself is well-formed. If the caller knows ahead of time that the value it's providing is valid, it can call the "skip validation" factory method. If the instance members off a UnicodeScalar instance misbehave, it means that the caller who originally constructed it violated an invariant at construction time.

The reason for the difference is that it's going to be common to construct a Utf8String instance from some unknown data coming in over i/o. It's not common to construct a UnicodeScalar instance from arbitrary data. Instances of this type are generally constructed from enumerating over UTF-8 / UTF-16 data, and significant bit twiddling needs to happen during enumeration anyway in order to transcode the original data stream into a proper scalar value. Detection of invalid subsequences would necessarily need to occur during enumeration, which means the caller already has the responsibility of fixing up invalid values. The "skip validation" factory is simply a convenience for callers who have already performed this fixup step to avoid the additional validation logic in hot code paths.

So when I use the term "contractual guarantee", it's really shorthand for "This API behaves as expected as long as the caller didn't do anything untoward while constructing the instance. If the API misbehaves, take it up with whoever constructed this instance, as they expressly ignored the overloads that tried to save them from themselves and went straight for the 'I know what I'm doing' APIs."

GrabYourPitchforks commented 6 years ago

FWIW, the reason for this design is that it means that consumers of these types don't have to worry about any of this. Just call the APIs like normal and trust that they'll give you sane values. If you take UnicodeScalar as a parameter to your method, you don't need to perform an IsValid check on it before you use it. Rely on the type system's enforcement to have prohibited the caller from even constructing a bad instance in the first place. (Modulo the caller doing something explicitly marked as dangerous, of course.)

This philosophy is different from the Rune proposal, where if you take a Rune as a parameter into your method you need to perform an IsValid check as part of your normal parameter validation logic since there's otherwise no guarantee that the type was constructed correctly.

whoisj commented 6 years ago

I suppose those are safe-enough trade-offs. Still, too bad the name has to be so unwieldy. :man_shrugging:

GrabYourPitchforks commented 6 years ago

The name doesn't have to be unwieldy. If there's consensus that it should be named Rune or similar, I'll relent on the naming. :)

KrzysztofCwalina commented 6 years ago

We should not call it a "Rune" if it's not a representation for the Unicode Code Point, i.e. let's not hijack a good unambiguous term and use it for something else.

Tornhoof commented 6 years ago

ᚺᛖᛚᛚᛟ᛫ᚹᛟᚱᛚᛞ Are you sure about Rune? It's a Unicode Block after all. Maybe rather grapheme?

KrzysztofCwalina commented 6 years ago

I think in graphemics (branch of science studying writing) rune is indeed a grapheme. I think in software engineering, rune is a code point. But possibly it's not such a clear cut as I think. The point I was trying to make is using "rune" to mean Unicode Scalar would be at least yet another overload of the word "rune".

Tornhoof commented 6 years ago

Thank you for clarifying that.

benaadams commented 6 years ago

The UnicodeScalar is a more sane type to use than char as it always encompasses a full character rather than potentially half a character, which char can do.

However, while UnicodeScalar is fine for a library type it isn't great for common usage for people that don't like var as you'd go from the less correct

foreach (char ch in str)
{
    // ...
}

to the more correct

foreach (UnicodeScalar scalar in ustr)
{
    // ...
}

which is a less desirable, very verbose, code representation for a single character.

KrzysztofCwalina commented 6 years ago

Aside for reusable library developers, I don't think many people will be using (referring to) UnicodeScalar. Most people will create or get an instance of Utf8String, and the instance methods on this type, and most of these either don't use UnicodeScalar, or will be used by passing a literal ('a' or 65). If somehow the type becomes super popular, we can think about adding a language alias.

benaadams commented 6 years ago

I think you might be underestimating the usage...

ReadOnlySpan<UnicodeScalar> values = new UnicodeScalar[] {'😊','😎','😥','🎄'};
int index = str.IndexOfAny(values);

Or iterating over chars to determine what emoji-range the character fits into for "sentiment" analysis

KrzysztofCwalina commented 6 years ago

Yeah, my teenage daughter also thinks I underestimate the value and usage of emojis :-) And I think you both might be right :-)

benaadams commented 6 years ago

Unless you are doing interop; Utf8String is probably going to be the go to string type. Its smaller when ascii; deals with whole characters and is more efficient when transferring on the wire (as its already in the correct format so needs no transcoding)

Whereas string is twice the size when ascii and deals in half characters; you have to worry about byte order; and often you have to repeatedly transcode to utf8. In my experience people just ignore the half-character issue so most current string handling is wrong*. Using Utf8String means most string handling would be correct by default.

So I think use of UnicodeScalar may be higher than anticipated.

*From my limited observations

whoisj commented 6 years ago

However, while UnicodeScalar is fine for a library type it isn't great for common usage for people that don't like var as you'd go from the less correct

QFT. This is my issue, @benaadams has hit the nail on the head here. Fairly sure we cannot expect char and string to be co-opted by UnicodeScalar and Utf8String but it would be lovely if they could be.

migueldeicaza commented 6 years ago

Disagree with the idea that this won’t be used extensively: it will be used everywhere a char is used today. Almost every use of Char needs to be replaced by this.

You can search for ‘char’ in any .NET code base to get an idea of how often people use this primitive type.

char is essentially broken for proper processing, we only survive because most people brush this off as ‘something went wrong on a corner case’.

Basically, .NET today encourages subtly broken code by default which you can make correct with a lot of work.

We need to strive to create an environment where we make it easy for people to write correct code from the start, for them to fall into the pit of success.

Long names like ‘UnicodeScalar’ are just going to prevent people from embracing it by default. The notion that “we will wait and see if there is demand” is a self fulfilling prophesy into failure.

Research wise, just grep for ‘rune’ in the go codebase to show you how misguided the idea that his is a rare type.

Go here ensures that any beginner gets correct code from the start. We ended up in the other extreme.

benaadams commented 6 years ago

Yeah, my teenage daughter also thinks I underestimate the value and usage of emojis :-)

Windows even has an emoji keyboard...

image

We've seen a huge rise in the use of emoji and the odder variants of unicode (script, upside down letters, lookalike chars etc) in our consumer focused applications. Initially we we resistant to it; but at this point its basically time to wholly embrace it as just the way things are.

So we will be moving to Utf8String for everything when its available (other than system calls); and using it for all new projects.

Also ASP.NET Core should consider moving precompiled Razor pages to be Utf8String based rather than string based and going though Encoding.UTF8 for every static string in every page request.

benaadams commented 6 years ago

Also all these methods are broken for supplementary plane chars/emoji

partial class String
{
    bool Contains(char ...);
    int IndexOf(char ...);
    int IndexOfAny(char[] ...);
    bool EndsWith(char ...);
    string Join<T>(char, IEnumerable<T>);
    string Join(char, ...);
    int LastIndexOf(char, ...);
    int LastIndexOfAny(char, ...);
    string PadLeft(int, char);
    string PadRight(int, char);
    string Replace(char, char);
    string[] Split(char ...);
    string[] Split(char[] ...);
    bool Trim(char);
    bool Trim(char[]);
    bool TrimEnd(char);
    bool TrimEnd(char[]);
    bool TrimStart(char);
    bool TrimStart(char[]);
}

So should the following additional overloads be added to regular string?

partial class String
{
    bool Contains(UnicodeScalar ...);
    int IndexOf(UnicodeScalar ...);
    int IndexOfAny(ReadOnlySpan<UnicodeScalar> ...);
    bool EndsWith(UnicodeScalar ...);
    string Join<T>(UnicodeScalar, IEnumerable<T>);
    string Join(UnicodeScalar, ...);
    int LastIndexOf(UnicodeScalar, ...);
    int LastIndexOfAny(UnicodeScalar, ...);
    string PadLeft(int, UnicodeScalar);
    string PadRight(int, UnicodeScalar);
    string Replace(UnicodeScalar, UnicodeScalar);
    string[] Split(UnicodeScalar ...);
    string[] Split(ReadOnlySpan<UnicodeScalar> ...);
    bool Trim(UnicodeScalar);
    bool Trim(ReadOnlySpan<UnicodeScalar>);
    bool TrimEnd(UnicodeScalar);
    bool TrimEnd(ReadOnlySpan<UnicodeScalar>);
    bool TrimStart(UnicodeScalar);
    bool TrimStart(ReadOnlySpan<UnicodeScalar>);
}

Or would they go via an implicit conversion to string; and the string overloads? Or would the C# compiler change the embedded type from UnicodeScalar to string depending on what overload was available at the call site?

This would have been popular/memed recently for some unknown reason:

string.Join('👏', words)
ufcpp commented 6 years ago

A example of why we should use UnicodeScalar is the C# compiler. It has a bug on using surrogate pairs:

class Program
{
    static void Main()
    {
        // Error CS1056 Unexpected character
        int 𩸽 = 2; // CJK Extension B
        int 𒀀 = 3; // Cuneiform
        int 𓀀 = 5; // Egyptian Hieroglyph
        System.Console.WriteLine(𩸽 * 𒀀 * 𓀀);
    }
}

Unicode category of these characters are Lo (Other Letter). And Lo characters can be used for identifiers in the C# spec.

Many language other than C# can use surrogate pairs correctly.

Go:

package main
import "fmt"
func main() {
    𩸽 := 2
    𒀀 := 3
    𓀀 := 5
    fmt.Println(𩸽 * 𒀀 * 𓀀)
}

Java:

public class HelloWorld
{
  public static void main(String[] args)
  {
    int 𩸽 = 2;
    int 𒀀 = 3;
    int 𓀀 = 5;
    System.out.print(𩸽 * 𒀀 * 𓀀);
  }
}
KrzysztofCwalina commented 6 years ago

Let me clarify: I definitely don't think we should be using "char" or any such type in Utf8String APIs. The APIs need to be 100% reliable, and UnicodeScalar is the only way to do it, i.e I am a big fan of UnicodeScalar. What i was saying is that I don't think users will have to refer to the type often, e.g utf8String.Split('a') would call a method taking UnicodeScalar, and C# compiler would target type the parameter, i.e. there would be not Char created and then converted to UnicodeScalar; C# would create scalar value directly.

[Edit]: let me also motivate why I even make the claim: some people commented that UnicodeScalar is a long name and that we need a language alias for it. I think the type will not be referred to often enough to justify adding language alias, at least not initially. If we discover that people refer to it all the time and it becomes bread and butter of C# programming, we can always add an alias later.

whoisj commented 6 years ago

The APIs need to be 100% reliable, and UnicodeScalar is the only way to do it

Agreed on the concept that a 32-bit Unicode scalar type needs to be present, and the string (notice not string) types need to utilize that and not the 16-bit char type.

I think the type will not be referred to often enough to justify adding language alias, at least not initially.

I disagree. I tend to have code littered with char[] and const char declarations. I just opened a random project and did a search for \bchar\b and in 72 files I got 155 hits. Having to type / look at UnicodeScalar in place of all of those char keywords would be... uh... less than optimal, yes - let's put it that way because it sound pleasant and professional: "less than optimal".

GrabYourPitchforks commented 6 years ago

Since the question of graphemes came up, I'll mention that we've been punting on the idea of having graphemes as a first-class citizen in the framework. (By "grapheme", I mean interpreting the 2-scalar sequence [ U+1F474 Older Man Emoji, U+1F3FF Fitzpatrick Type 6 Skin Modifier ] as the single grapheme "Older man with Fitzpatrick type 6 skin".) The reason we had been punting on this is that it tends to be more of a UI / text editor concern - not a general framework concern - and there is the existing TextElementEnumerator type if you're willing to pull up your sleeves.

@migueldeicaza, I think you had some early thoughts on this a while back. Has your thinking changed on what kind of support we should have in-box for this? Is this really a concern for the BCL, or does it properly belong in a separate package?

GrabYourPitchforks commented 6 years ago

Some other feedback now that I'm rereading this thread.

@benaadams - I agree that we should add UnicodeScalar-accepting APIs to System.String if there's a need to do so. If the majority use case is that developers are searching for literals, the existing APIs like String.IndexOf("<my multi-char emoji>") already work just fine.

@whoisj - Regarding your const char fields, what if the compiler could implicitly convert char to UnicodeScalar? That way your call site would look like myUtf8String.IndexOf(my_const_char). We can also consider adding an explicit conversion operator between the two types.

migueldeicaza commented 6 years ago

Some comments on the API from last week:

Given that it should be possible to create Utf8Strings from invalid operations, we need a way of returning a value that indicates that there was an error processing the utf8 sequence in the buffer and indicate that this was caused due to this error. NStack has a port of Go's libraries that do this.

It is not clear why Length of Runes should be O(1), seems like a waste of space, specially considering that iterating over the values is not O(1) anyways.

When processing utf8strings you really want to have access to the byte-length, that is missing.

I would add a few things:

There is a proposal for UnicodeScalar, you should lift the operations I submitted before on a better named type, Rune that has a comprehensive API that I have been using for a while.

I don’t think that it is a good idea to limit UnicodeScalar to valid values, I think you should instead have an IsValid method.

My feeling is that if you want a string with training wheels we have System.String already - but a case should be made for a System.UnicodeString that is made up 32-bit tunes, that has O(1) indexing capabilities.

One nice capability that the Go API has is that enumerating over the Runes in the string is not limited to obtaining the individual runes, but also the offset where the rune was found. In NStack, I have a similar method that returns a tuple (int index, Rune rune) that achieves this.

GrabYourPitchforks commented 6 years ago

See also the UTF-8 string scenario and design philosophy document (https://github.com/dotnet/corefxlab/issues/2368).

GrabYourPitchforks commented 6 years ago

There's some confusion over runtime complexity of the Utf8String length APIs. Fetching the code unit count ("byte length", if you will) is O(1) complexity. The Utf8String type internally doesn't keep track of the total number of scalars ("runes", if you will), and there's no API on Utf8String to fetch this count. There are other APIs which will give you this information, but the proposal here doesn't expose those APIs on Utf8String.

GrabYourPitchforks commented 6 years ago

I've added the missing APIs to the Tuesday review. Thanks for the eagle eyes @migueldeicaza!

whoisj commented 6 years ago

As a note, it seems to me that Utf8String.Length ought to be the byte length of the underlying array (it is an array, isn't it - are we considering using linked buffers or something - doesn't really matter).

Given that Utf8String are likely immutable, and therefore the count of characters cannot be cached in the type after initialization; and nobody in their right mind would suggest computing the actual character count of a Utf8String at allocation, we likely need a .Count() method which does the calculation when invoked.

The Utf8String.Enumerator implementation ought to be interesting. There are a lot of way to do this, none of which are particularly pleasant. 😕

@GrabYourPitchforks if you looking for any parallel implementations (aside from the stuff by @migueldeicaza) let me know and I can send you a few links to internal source we use to handle Git strings (Git is Utf8 through-and-through) - ironically, we called our type StringUtf8 😏

Also, if you're just plain sick of my feedback let me know and I'll go sulk in a corner quietly :grin:

As a complete, and mostly unrelated side-note, I've always hate the String API which take StringComparison. I would so incredibly rather every API took a StringComparer implementation and we could leave the annoying, and mostly useless StringComparision enumeration in the bin.

Any chance we can avoid dragging it into the future via this API set? 🙏 :bow: 🙏

whoisj commented 6 years ago

My feeling is that if you want a string with training wheels we have System.String already - but a case should be made for a System.UnicodeString that is made up 32-bit tunes, that has O(1) indexing capabilities.

On this topic, I cannot count the number of times that a p/invoke call to some library has returned corrupted or invalid string values; and since System.String lacks any ability to self validate, I always end up adding validation routines to project. Seems like something we ought to avoid in the future, ala @migueldeicaza recommendation (this is what I was trying, and failed to illuminate above).

Oh, and in case it needs to be said: much ❤️ and 🙇 for @GrabYourPitchforks for even working on this API. It is long over due and is such a hot topic; it's pure heroism to work on it. 😃

GrabYourPitchforks commented 6 years ago

@whoisj By self-validate a System.String instance, you mean looking for mismatched surrogate pairs? I considered in this project adding validation APIs to both Utf8String and String but ultimately decided against it for a few reasons. I didn't want developers to feel obligated to call it before consuming the instances. And for String in particular, it's generally very difficult to create a malformed instance in the first place without bit-twiddling a char[]. If you're running into the need to validate in a production app I'm certainly willing to reconsider those decisions.

@whoisj What's your concern with StringComparison? Are your scenarios working with specific cultures rather than the invariant culture or the current thread's culture?

@migueldeicaza Interestingly, we considered a fully UTF-32 string type a few weeks ago, and I don't think it's a crazy idea. The primary scenario we came up with was a text editor or other UI-based application. I think if we wanted to give that scenario proper respect we'd also want to consider grapheme representation in the framework and plumb it through as an in-box concept. Do you think server applications might need this in addition to UI applications? (As an aside, C++'s std::wstring on most non-Windows platforms is UTF-32, and it seemingly doesn't enjoy wide use.)

whoisj commented 6 years ago

@whoisj What's your concern with StringComparison? Are your scenarios working with specific cultures rather than the invariant culture or the current thread's culture?

More often that not, I work on library code that interops with external software. I primarily am concerned with Orignal and OrdinalCaseInsensitive, very-very rarely do I need to care about culture.

Often, I need to produce custom string comparers, when this happens every entry-point on string that takes a StringComparison becomes useless to me and I end-up re-implementing them.

What kind of custom string comparer could one be writing that isn't provided by NetFx? Well, several projects I've worked on recently needed custom file system path comparers, and for the compare to chosen based on conditions. For example, on Windows paths that do not begin with "\?\" treat '/' and '\' interchangeably, thus a custom comparer is necessary. I could just

OrdinalCaseInsensitive.Equals(lhsPath.Replace('/', '\\'), rhsPath.Replace('/', '\\'))

but we literally compare thousands of paths in certain cases and thrashing the HEAP with needless string allocations is really terrible. So instead, I have chunking logic which breaks on path separators... blah, blah, blah.

Now consider the logic necissary for something like bool IsChildPath(string parentPath, string childPath). Internally I could use string.StartsWith(...) but it doesn't accept a StringComparer, instead it takes a StringComparison. Leaving me to author my own static bool StartsWith(this String value, StringComparer comparer) method. If System.String.StartsWith accepted a StringComparer life would just be better.

@whoisj By self-validate a System.String instance, you mean looking for mismatched surrogate pairs?

Exactly.

I didn't want developers to feel obligated to call it before consuming the instances.

Developers should not feel obligated to use validation API, but the lack of a validation API can cause heartburn. Consider the developer who is writing software that reads data from a stream or shared memory. There's always a change something got corrupted, so having a built-in way to validate the data would be rather useful. Perhaps developers writing code like this are rare enough that NetFx doesn't need it, in which case I'll continue to keep writing my own. 😁

... we considered a fully UTF-32 string type a few weeks ago, and I don't think it's a crazy idea. Do you think server applications might need this in addition to UI applications?

I can see utility in an indexable string type, but it'll be very specialized. I'd much rather see the work your doing here stay the focus. UTF-8 as the internal encoding for character data is extremely valuable, especially when memory isn't plentiful and cheap.

... oh, as an aside - are there going to be Utf8StringComparer types provided? If so, have you thought about the implementation details yet?