Closed GrabYourPitchforks closed 3 years ago
Hmm but an Uft8 "char" can really be more than a byte how you will represent this as an alias of Byte? For me Rune is better and I'll find confusing to have Rune and Utf8Char togheter...
@fanoI Utf8Char
is intended to be treated as an 8-bit code unit (see http://www.unicode.org/glossary/#code_unit), similarly to how System.Char
is intended to be treated as a 16-bit code unit. Neither is guaranteed to be an actual character in the linguistic sense. Rune
is a 24-bit scalar value (http://www.unicode.org/glossary/#unicode_scalar_value) whose meaning is independent of the actual encoding.
FWIW, you'd only ever see Utf8Char
if you're trying to iterate over the code units of a Utf8String
directly, just like you'd only ever see Char
if you're trying to iterate over the code units of a String
directly. It's not something most developers will be exposed to.
This proposal seems to imply that Utf8Char has value because it is a "validated code unit"
But:
a) As you observed, Span
Span<Utf8Char> validUtf8 = Utf8String.CreateValidated("\u2019").Bytes;
Span<Utf8Char> invalidUtf8 = validUtf8.Slice(1);
@KrzysztofCwalina
It's not quite a "validated code unit". Consider the existing behavior that .NET has regarding byte[]
, char[]
, and String
. Today, pretty much all factories that create String
(or char[]
) instances from an incoming byte[]
will ensure that the output is well-formed. Even if using a UTF-16LE decoder, the conversion from byte[]
to char[]
performs validation rather than doing a simple bitblt. Every deserialization library I've ever reviewed also performs this check. This means that it's pretty much impossible for untrusted user input to end up directly as a non-UTF16-compliant String
instance.
Now, this doesn't mean that every String
instance is guaranteed to be UTF16-compliant. For example, the developer could populate a char[]
manually and call the String
ctor. He could split a String
instance across a surrogate pair. He could bitblt a byte[]
to a char[]
. But the key point is that the developer must go out of his way to do this. The standard String
construction routines, encoding / decoding routines, and serializers do not allow this.
I'm advocating that Utf8Char
have the same meaning and this Utf8String
have the same behavior. Any routines provided by the Framework or libraries to convert from byte[]
to Utf8Char[]
or Utf8String
necessarily result in validation. If the developer constructs a Utf8Char[]
manually or splits a Utf8String
instance in the middle of a code point or bitblts a byte[]
to Utf8Char[]
, he can end up with a non-UTF8-compliant Utf8String
instance.
That's why I think it's important to distinguish between byte
as the unit of representation for binary data and Utf8Char
as the unit of representation for UTF8 textual data. It prevents developers from treating any arbitrary byte[]
as if it were intended to be valid textual UTF8 data.
This has consequences for the proposed Utf8String
class. For instance, the constructor that accepts a Span<byte>
would assume that the incoming data is unvalidated UTF8 and would need to validate accordingly. The constructor that accepts a Span<Utf8Char>
would assume the incoming data is valid and wouldn't perform any additional validation. The constructor that accepts a Span<char>
would perform a conversion from UTF-16LE (or UTF-16BE depending on platform) to UTF8.
BTW, for library or framework developers, you could imagine a convenience API that switches between the two.
static Span<Utf8Char> GetUtf8Chars(Span<byte>);
The exact shape is TBD, but the idea is that if the input is well-formed UTF8, it would simply re-cast the Span<byte>
to a Span<Utf8Char>
; and if the input is malformed UTF8, it would allocate a Utf8Char[]
under the covers where all invalid code units have been converted to U+FFFD
, and the returned Span<Utf8Char>
would instead point to that. We'd have to perform some tests to make sure devs don't fall into a pit of failure with this.
(A better name might be Utf8CodeUnit
. See D77 of the Unicode Standard.)
I have to say I still don't understand the value of this type. i.e. what guarantees does it give you above byte?
That's why I think it's important to distinguish between byte as the unit of representation for binary data and Utf8Char as the unit of representation for UTF8 textual data. It prevents developers from treating any arbitrary byte[] as if it were intended to be valid textual UTF8 data.
Here I am treating arbitrary byte[] ats UTF8: var text = Encoding.UTF8.GetString(new byte[] { 255 });
So, I don't think we can prevent developers from "treating arbitrary byte[] as UTF8", nor it's probably the goal you had in mind. And so, I would love if we articulated very precisely what we want to accomplish, really, by introducing this type.
Separately, A simple slice on Span
Also, our transformation APIs (ITransformation) have to operate on Span
Possibly we could chat in person for me to understand your point better and then we would report back to this thread?
That's a fair point. I don't think I have elucidated the scenario very well. Give me some time to collect my thoughts and we can discuss.
FWIW, as described earlier I'm not concerned about the scenario where the developer intentionally splits a valid Span in such a way as to create a new invalid Span.
We are settling on Memory
If we wanted to do the same for Utf8String, we would need some other type than byte, as we don't want text specific extensions methods to show in Memory
The main thing for me would be the perf impact of using a struct wrapping byte vs byte directly.
While looking at the encoding and transformation APIs I noticed that all data is represented by
Span<byte>
. I believe this behavior to be incorrect because it doesn't draw a distinction between textual data (data that consists of letters, symbols, and graphemes) and binary data (data that consists of arbitrary octets). I believe that if we just expose all data - both textual and binary - asSpan<byte>
and require developers to know from context how the data should be treated then this will lead to a pit of failure.I propose a new type
System.Utf8Char
which is an 8-bit analog toSystem.Char
. The general idea is that if the developer seesSpan<byte>
in his code, he should treat it as unstructed binary data; but if he seesSpan<Utf8Char>
orSpan<char>
in his code, then he knows that it's meaningful textual data.One key difference between binary data and textual data is that binary data can interact directly with i/o. Textual data is not interchangeable outside the application unless it is first somehow converted to binary data. We would provide methods for developers to convert between
Span<byte>
andSpan<Utf8Char>
. The conversion routine would basically be a glorified validation routine and memcpy. If a developer really wanted to avoid the memcpy and knew ahead of time that the incoming binary data was well-formed UTF8 then he could bitblt fromSpan<byte>
toSpan<Utf8Char>
directly, but we should take care not to encourage this as a common practice since bypassing validation could lead to security issues at runtime.Ideally this new type would be an intrinsic, but it can be mocked at the moment by using a custom struct.
The other reason for the introduction of this type is that the encoding and decoding transforms now as part of their API signature explicitly state whether they're intended to act on binary data or textual data. For example, you'd have
ToLower(Span<Utf8Char>)
instead ofToLower(Span<byte>)
since only text - not arbitrary binary data - can be converted to lowercase. And you'd haveDecompress(Span<byte>)
instead ofDecompress(Span<Utf8Char>)
since compression routines inherently work on arbitrary binary data, not on textual data.You can even extend this to a heapable
System.Utf8String
type whose API surface is equivalent toSystem.String
in almost all ways but which is backed by a collection ofUtf8Char
elements instead ofChar
elements. Ref struct equivalents would beUtf8StringSegment
(which is really just aReadOnlySpan<Utf8String>
with a bunch of string-like helper methods) andStringSegment
(which is really justReadOnlySpan<char>
). But I believe the design of the primitive stands on its own merit independent of whether these wrapper classes ever come to fruition.This proposal is not intended to replace the
System.Rune
proposal. There is significant value in having that type since you can imagine scenarios where developers simply want to iterate over some piece of textual data without worrying about what the underlying encoding is, andRune
is an appropriate way to surface that.