dotnet / corefxlab

This repo is for experimentation and exploring new ideas that may or may not make it into the main corefx repo.
MIT License
1.46k stars 347 forks source link

Introduce new primitive type System.Utf8Char #1799

Closed GrabYourPitchforks closed 3 years ago

GrabYourPitchforks commented 6 years ago

While looking at the encoding and transformation APIs I noticed that all data is represented by Span<byte>. I believe this behavior to be incorrect because it doesn't draw a distinction between textual data (data that consists of letters, symbols, and graphemes) and binary data (data that consists of arbitrary octets). I believe that if we just expose all data - both textual and binary - as Span<byte> and require developers to know from context how the data should be treated then this will lead to a pit of failure.

I propose a new type System.Utf8Char which is an 8-bit analog to System.Char. The general idea is that if the developer sees Span<byte> in his code, he should treat it as unstructed binary data; but if he sees Span<Utf8Char> or Span<char> in his code, then he knows that it's meaningful textual data.

One key difference between binary data and textual data is that binary data can interact directly with i/o. Textual data is not interchangeable outside the application unless it is first somehow converted to binary data. We would provide methods for developers to convert between Span<byte> and Span<Utf8Char>. The conversion routine would basically be a glorified validation routine and memcpy. If a developer really wanted to avoid the memcpy and knew ahead of time that the incoming binary data was well-formed UTF8 then he could bitblt from Span<byte> to Span<Utf8Char> directly, but we should take care not to encourage this as a common practice since bypassing validation could lead to security issues at runtime.

Ideally this new type would be an intrinsic, but it can be mocked at the moment by using a custom struct.

using System.Runtime.InteropServices;

namespace System
{
    // An 8-bit type similar to but distinct from System.Byte.
    // Utf8Char is not integral (no arithmetic operations) or binary (no bitwise operations)
    // but is comparable (allow ==, <, etc.).
    [StructLayout(LayoutKind.Auto, Size = 1)]
    public struct Utf8Char : IComparable<Utf8Char>, IEquatable<Utf8Char>
    {
        private readonly byte _value;

        public Utf8Char(Utf8Char other)
        {
            _value = other._value;
        }

        public static bool operator ==(Utf8Char a, Utf8Char b) => a._value == b._value;

        public static bool operator !=(Utf8Char a, Utf8Char b) => a._value != b._value;

        public static bool operator <(Utf8Char a, Utf8Char b) => a._value < b._value;

        public static bool operator <=(Utf8Char a, Utf8Char b) => a._value <= b._value;

        public static bool operator >(Utf8Char a, Utf8Char b) => a._value > b._value;

        public static bool operator >=(Utf8Char a, Utf8Char b) => a._value >= b._value;

        public static implicit operator byte(Utf8Char value) => value._value;

        public static implicit operator Utf8Char(byte value) => new Utf8Char(value);

        // other implicit conversions go here
        // if intrinsic then casts can be properly checked or unchecked

        public int CompareTo(Utf8Char other)
        {
            return this._value.CompareTo(other._value);
        }

        public override bool Equals(object other)
        {
            return (other is Utf8Char) && (this == (Utf8Char)other);
        }

        public bool Equals(Utf8Char other)
        {
            return (this == other);
        }

        public override int GetHashCode()
        {
            return _value;
        }

        public override string ToString()
        {
            return _value.ToString();
        }
    }
}

The other reason for the introduction of this type is that the encoding and decoding transforms now as part of their API signature explicitly state whether they're intended to act on binary data or textual data. For example, you'd have ToLower(Span<Utf8Char>) instead of ToLower(Span<byte>) since only text - not arbitrary binary data - can be converted to lowercase. And you'd have Decompress(Span<byte>) instead of Decompress(Span<Utf8Char>) since compression routines inherently work on arbitrary binary data, not on textual data.

You can even extend this to a heapable System.Utf8String type whose API surface is equivalent to System.String in almost all ways but which is backed by a collection of Utf8Char elements instead of Char elements. Ref struct equivalents would be Utf8StringSegment (which is really just a ReadOnlySpan<Utf8String> with a bunch of string-like helper methods) and StringSegment (which is really just ReadOnlySpan<char>). But I believe the design of the primitive stands on its own merit independent of whether these wrapper classes ever come to fruition.

This proposal is not intended to replace the System.Rune proposal. There is significant value in having that type since you can imagine scenarios where developers simply want to iterate over some piece of textual data without worrying about what the underlying encoding is, and Rune is an appropriate way to surface that.

fanoI commented 6 years ago

Hmm but an Uft8 "char" can really be more than a byte how you will represent this as an alias of Byte? For me Rune is better and I'll find confusing to have Rune and Utf8Char togheter...

GrabYourPitchforks commented 6 years ago

@fanoI Utf8Char is intended to be treated as an 8-bit code unit (see http://www.unicode.org/glossary/#code_unit), similarly to how System.Char is intended to be treated as a 16-bit code unit. Neither is guaranteed to be an actual character in the linguistic sense. Rune is a 24-bit scalar value (http://www.unicode.org/glossary/#unicode_scalar_value) whose meaning is independent of the actual encoding.

FWIW, you'd only ever see Utf8Char if you're trying to iterate over the code units of a Utf8String directly, just like you'd only ever see Char if you're trying to iterate over the code units of a String directly. It's not something most developers will be exposed to.

KrzysztofCwalina commented 6 years ago

This proposal seems to imply that Utf8Char has value because it is a "validated code unit" But: a) As you observed, Span can be just cast to Span so the consumers cannot simply assume that validation has happened. b) I am not sure it's interesting to know that a single code unit is validated. I think it would be more interesting to know that a sequence of such code units represents valid UTF8, and this we cannot ensure:

Span<Utf8Char> validUtf8 = Utf8String.CreateValidated("\u2019").Bytes;
Span<Utf8Char> invalidUtf8 = validUtf8.Slice(1);
GrabYourPitchforks commented 6 years ago

@KrzysztofCwalina

It's not quite a "validated code unit". Consider the existing behavior that .NET has regarding byte[], char[], and String. Today, pretty much all factories that create String (or char[]) instances from an incoming byte[] will ensure that the output is well-formed. Even if using a UTF-16LE decoder, the conversion from byte[] to char[] performs validation rather than doing a simple bitblt. Every deserialization library I've ever reviewed also performs this check. This means that it's pretty much impossible for untrusted user input to end up directly as a non-UTF16-compliant String instance.

Now, this doesn't mean that every String instance is guaranteed to be UTF16-compliant. For example, the developer could populate a char[] manually and call the String ctor. He could split a String instance across a surrogate pair. He could bitblt a byte[] to a char[]. But the key point is that the developer must go out of his way to do this. The standard String construction routines, encoding / decoding routines, and serializers do not allow this.

I'm advocating that Utf8Char have the same meaning and this Utf8String have the same behavior. Any routines provided by the Framework or libraries to convert from byte[] to Utf8Char[] or Utf8String necessarily result in validation. If the developer constructs a Utf8Char[] manually or splits a Utf8String instance in the middle of a code point or bitblts a byte[] to Utf8Char[], he can end up with a non-UTF8-compliant Utf8String instance.

That's why I think it's important to distinguish between byte as the unit of representation for binary data and Utf8Char as the unit of representation for UTF8 textual data. It prevents developers from treating any arbitrary byte[] as if it were intended to be valid textual UTF8 data.

This has consequences for the proposed Utf8String class. For instance, the constructor that accepts a Span<byte> would assume that the incoming data is unvalidated UTF8 and would need to validate accordingly. The constructor that accepts a Span<Utf8Char> would assume the incoming data is valid and wouldn't perform any additional validation. The constructor that accepts a Span<char> would perform a conversion from UTF-16LE (or UTF-16BE depending on platform) to UTF8.

GrabYourPitchforks commented 6 years ago

BTW, for library or framework developers, you could imagine a convenience API that switches between the two.

static Span<Utf8Char> GetUtf8Chars(Span<byte>);

The exact shape is TBD, but the idea is that if the input is well-formed UTF8, it would simply re-cast the Span<byte> to a Span<Utf8Char>; and if the input is malformed UTF8, it would allocate a Utf8Char[] under the covers where all invalid code units have been converted to U+FFFD, and the returned Span<Utf8Char> would instead point to that. We'd have to perform some tests to make sure devs don't fall into a pit of failure with this.

ektrah commented 6 years ago

(A better name might be Utf8CodeUnit. See D77 of the Unicode Standard.)

KrzysztofCwalina commented 6 years ago

I have to say I still don't understand the value of this type. i.e. what guarantees does it give you above byte?

That's why I think it's important to distinguish between byte as the unit of representation for binary data and Utf8Char as the unit of representation for UTF8 textual data. It prevents developers from treating any arbitrary byte[] as if it were intended to be valid textual UTF8 data.

Here I am treating arbitrary byte[] ats UTF8: var text = Encoding.UTF8.GetString(new byte[] { 255 });

So, I don't think we can prevent developers from "treating arbitrary byte[] as UTF8", nor it's probably the goal you had in mind. And so, I would love if we articulated very precisely what we want to accomplish, really, by introducing this type.

Separately, A simple slice on Span (as in the sample above) would potentially create invalid UTF8, and the developer does not have to "go out of their way" to slice it. Not to mention all the APIs we have to cast out ReadOnly.

Also, our transformation APIs (ITransformation) have to operate on Span so that they can abstract the operation.

Possibly we could chat in person for me to understand your point better and then we would report back to this thread?

GrabYourPitchforks commented 6 years ago

That's a fair point. I don't think I have elucidated the scenario very well. Give me some time to collect my thoughts and we can discuss.

FWIW, as described earlier I'm not concerned about the scenario where the developer intentionally splits a valid Span in such a way as to create a new invalid Span.

KrzysztofCwalina commented 6 years ago

We are settling on Memory and Span being the representation for slices of UTF16Strings, i.e. we dont think we need a special heap-friendly Utf16Segment.

If we wanted to do the same for Utf8String, we would need some other type than byte, as we don't want text specific extensions methods to show in Memory/Span.

The main thing for me would be the perf impact of using a struct wrapping byte vs byte directly.