Open GrabYourPitchforks opened 5 years ago
Levi, how exactly does this mesh with Rune?
Re the sections quoted below, this gives me the heeby-jeebies. From my one very small point of view, with an emphasis on MSRC-proofing, I would be much happier if Utf8Char and ReadOnlySpan
For the chunking use case, I'd propose some sort of intermediate representation that would convey the semantic intent of "chunk of a well-formed ReadOnlySpan
Am I making sense? In my mind, all the concepts in this discussion are related to the concept of 'input taints', but encoded in the type system rather than naming convention.
@terrajobst
The UTF-8 code unit type Utf8Char does not attempt to validate its input (snip) In the above example, this creates a Utf8Char instance with the value C0, even though the Unicode Specification expressly states that C0 is never a valid value for a UTF-8 code unit.
Framework APIs which operate on
ReadOnlySpan<Utf8Char>
should validate the input for well-formedness if such checks do not impose a hardship on the method implementation. There may be certain performance-sensitive routines which cannot incur that cost; such routines may assume the input is well-formed and may have undefined behavior if this invariant is violated, short of that behavior causing an access violation or other runtime corruption. For example, if a routine is given the single-element input[ C2 ]
, it mustn't attempt to read off the end of the source buffer. Routines which require well-formed input must be contracted as such. Chunking APIs (discussed later) must at the very least continue to check for boundary conditions, even if they don't check for other ill-formedness in the sequence.
I would be much happier if Utf8Char and ReadOnlySpan were guaranteed to be valid always.
Would be nice, but unfortunately it's impractical. It's analogous to Char
: individual elements are never guaranteed to have any semantic meaning in isolation. It's only when you have a string of them that they mean something.
If we ignore language semantics for now and look only at well-formedness, the sequence [ D800 ]
is not well-formed, but the sequence [ D800 DC00 ]
is well-formed. We don't want to prevent construction of the standalone char 0xD800, because even though it's ill-formed in isolation it can be well-formed when it exists as part of a larger sequence. Similarly, the Utf8Char 0xC2 is ill-formed in isolation, but it can be well-formed when part of the larger sequence [ C2 80 ]
. We don't want to prevent construction of the building block itself.
We also can't stop anybody from allocating a new Utf8Char[]
, populating it with garbage, and projecting a ReadOnlySpan<Utf8Char>
over it. Though to be honest this doesn't really concern me that much. Nothing technically stops people from doing the same thing with char[]
and ReadOnlySpan<char>
today, but it's just not something developers really feel compelled to do in practice. char[]
and string
factory methods are so much more convenient that developers are more inclined to use those. I see this same thing happening in the UTF-8 world: use the factories and you don't have to worry about any of this.
Levi, how exactly does this mesh with Rune?
Rune
, on the other hand, corresponds precisely to a Unicode scalar value. The well-formed UTF-8 sequence [ C2 80 ]
will correspond to the Rune U+0080; the well-formed UTF-16 sequence [ D800 DC00 ]
will correspond to the Rune U+10000. The transformation is reversible without loss of fidelity.
Consider the following transforms.
At the end of this process, a and e will be bit-for-bit identical, and b and d will be element-for-element identical. Additionally, even though a and c aren't bit-for-bit identical, they can be said to represent the same data (because as demonstrated you can translate back and forth between the two losslessly).
If a did not begin as a well-formed sequence, then something would get lost in translation and a and e would not be bit-for-bit identical at the end.
Because the Rune
constructor is validating, it is impossible for somebody to construct a bogus Rune
instance without dropping down to unsafe-equivalent code. That we can assume all instances are well-formed also gives us some decent perf benefits in various methods by allowing us to elide correctness checks.
Am I making sense? In my mind, all the concepts in this discussion are related to the concept of 'input taints', but encoded in the type system rather than naming convention.
That's pretty much accurate. Barring what I mentioned earlier about somebody fabricating a bogus char[]
, a sequence of char
means "somebody somewhere said that this buffer represents well-formed UTF-16 data, or at least it's a slice of a larger piece of well-formed UTF-16 data." A sequence of Utf8Char
ostensibly means the same thing, but for UTF-8.
Thanks Levi, that helps a lot in framing things correctly in my mind. Would it make sense to have an API design convention that the surface area of APIs should generally prefer to expose Runes to consumers and reserve UTtf8Char/ReadOnlySpan
System.Text
? We probably want to align this choice wherever we'll think we'll want to ship the UTF8 string type.It seems as a next step we should have a meeting with the language folks.
It's a shame this didn't get done :(
Are there currently any methods in .NET that would encapsulate this functionality (i.e. stuff like ToUpper
, Contains
, equality etc) for utf-8 ReadOnlySpan<byte>
? I can't find anything.
(Related: https://github.com/dotnet/corefx/issues/30503)
Motivations and driving principles behind the
Utf8Char
proposalUtf8Char
is synonymous withChar
: they represent a single UTF-8 code unit and a single UTF-16 code unit, respectively. They are distinct from the integral typesByte
andUInt16
in that sequences of the UTF-* code unit types are meant to represent textual data, while sequences of the integral types are meant to represent binary data.Drawing this distinction is important. With UTF-16 data (
String
,Char[]
), this distinction historically hasn't been a source of confusion. Developers are generally cognizant of the fact that aside from RPC, most i/o involves some kind of transcoding mechanism. Binary data doesn't come in from disk or the network in a format that can be trivially projected as a textual string; it must go through validation, recombining, and substitution. Similarly, when writing a string to disk or the network, a trivial projection is again impossible. The transcoding step must run in reverse to get the text data into the correct binary format expected by i/o.A brief interlude on conformance and security
There is a key aspect here that is often lost in nuance. The purpose of the transcoding step isn't simply to "shrink" a string of UTF-16 code units into a string of UTF-8 code units (conveniently the same size as octets!) so that it can be blasted across the wire, or vice versa. It is to do so in such a manner that the receiver can reconstruct the original string with full fidelity.
With UTF-8, it is tempting to perform a trivial projection between the binary i/o layer (bytes) and the textual layer (UTF-8 code units). The elemental data types are the same shape, after all, so a reinterpret cast seems legal at first glance. The problem with this design is that at a certain point, one or more components will need to operate on this text. If the text is ill-formed, the components may produce undefined behavior, or they may attempt to fix up the text on-the-fly but may disagree on the final shape of the fixed-up text. This violates the "with full fidelity" aspect mentioned in the previous paragraph.
As a concrete example, consider a web application that blindly treats all incoming form data as
ReadOnlySpan<byte>
and attempts to interpret it as UTF-8. Within the context of this single web application, there may not be a problem with this design. If the buffer contains ill-formed UTF-8, all of the APIs in the web application process might have undefined behavior as they're working with it, but they likely have consistent undefined behavior.Web applications almost never exist as a single isolated process, however. There is undoubtedly a persistent data store - a database or other backend service. If the web application forwards the
ReadOnlySpan<byte>
(containing ill-formed UTF-8) through to these layers, the backend layers could look at the same sequence of bytes and process them differently. Perhaps Component A is usingvarchar(UTF8)
for its backend storage but Component B is usingnvarchar
for its backend storage. There is now a mismatch - a loss of fidelity - between these two systems.This places us into a somewhat peculiar position with respect to security. We generally think of CVEs as affecting individual frameworks or individual applications, but this underrepresents a class of issues best described as "the API surface leads developers to writing applications which appear secure in isolation but which are in fact dangerous when used in conjunction with other applications."
Some examples of where vulnerabilities arise due to the interplay of components which handle ill-formed sequences UTF-8 differently:
These issues tend to go underreported in the public sphere because the attack often must be tailored to a specific deployment or configuration of an application.
Back to
Utf8Char
The proposal ultimately is to have
ReadOnlySpan<Utf8Char>
represent well-formed UTF-8 text data as much as possible. This mirrorsReadOnlySpan<Char>
, which generally represents well-formed UTF-16 text data. In both cases it's possible for a developer to intentionally create ill-formed payloads by creating and populating aUtf8Char[]
orChar[]
with garbage and then producing a span over that buffer. But since developers tend not to take such actions intentionally this shouldn't be a problem in practice. The standard way of getting aReadOnlySpan<Utf8Char>
from aReadOnlySpan<byte>
would be to use a factory that validates (and massages if necessary) the input data. This matches the behavior developers already expect when going from abyte
sequence to a UTF-16char
sequence.Generally speaking, Framework APIs which operate on
ReadOnlySpan<byte>
as UTF-8 input must not assume the input is well-formed and must have a well-defined behavior if ill-formed UTF-8 is encountered. The API may choose to take any number of actions - throw, return a failure code, perform replacement - as long as the behavior is part of the API contract and the caller understands this contract.Framework APIs which operate on
ReadOnlySpan<Utf8Char>
should validate the input for well-formedness if such checks do not impose a hardship on the method implementation. There may be certain performance-sensitive routines which cannot incur that cost; such routines may assume the input is well-formed and may have undefined behavior if this invariant is violated, short of that behavior causing an access violation or other runtime corruption. For example, if a routine is given the single-element input[ C2 ]
, it mustn't attempt to read off the end of the source buffer. Routines which require well-formed input must be contracted as such. Chunking APIs (discussed later) must at the very least continue to check for boundary conditions, even if they don't check for other ill-formedness in the sequence.For more information on conformance, validation, and the distinction between binary data and textual data, see:
Projections between
Utf8Char
andByte
The UTF-8 code unit type
Utf8Char
does not attempt to validate its input.In the above example, this creates a
Utf8Char
instance with the valueC0
, even though the Unicode Specification expressly states thatC0
is never a valid value for a UTF-8 code unit. Contrast this with theRune
type, whose constructor prohibits creating instances from values outside the valid Unicode scalar range.It is possible to project (reinterpret cast) a
{ReadOnly}Span<Utf8Char>
to aReadOnlySpan<byte>
. This is useful for operations like writing UTF-8 text directly to an i/o pipe.The projections
Span<Utf8Char> -> Span<byte>
and{ReadOnly}Span<byte> -> {ReadOnly}Span<Utf8Char>
should also be possible. We do not want to prevent developers from removing any safety rails we provide within the Framework, but we also don't want developers to remove those rails inadvertently. Projections which blur the lines between textual representation and binary representation in a "dangerous" manner should require an affirmative action from the developer. One possible way to get this affirmation is to require use of the existing reinterpret_cast-like API.The methods
Span<T>.ToString
andMemory<T>.ToString
(and their read-only equivalents) will be enlightened forT = Utf8Char
, just as they're enlightened forT = char
today. The behavior of the method will be to transcode the data to UTF-16 (with invalid sequence replacement if necessary) and to return the expectedString
instance. This enlightenment will not extend to the case whereT = byte
.Unlike
Span<T>
,Memory<T>
instances cannot be projected to a different typeMemory<U>
. This means that there is no way to cast betweenMemory<Utf8Char>
andMemory<byte>
(or their read-only equivalents) on-the-fly.Comparing to other languages
In Go 1.x,
string
and[]byte
are distinct sliceable types. Developers generally use strings to store textual data and byte slices to store binary data. This distinction is sometimes a bit blurry and developers may require external information (documentation, context, method names) to determine exactly what kind of information is represented by the slice, akin to using traditionalchar*
pointers in C.The biggest difference between the two types is that
string
represents truly immutable data (not just an immutable view into mutable data), where[]byte
represents mutable data. Thus there's no trivial projection possible between the two, and any conversion must necessarily be implemented as a copy. There are proposals for readonly slices in a future version of Go, though to the best of my knowledge these proposals have not been approved. If such a feature comes to fruition it seems like a non-copying projectionstring -> <readonly> []byte
would be allowed implicitly, but the reverse projection<readonly> []byte -> string
would still require a copy. (See https://github.com/golang/go/issues/20443 and https://groups.google.com/forum/#!topic/golang-dev/Y7j4B2r_eDw/discussion for further information.)In Swift, it is possible to create a UTF-* view over any
String
instance. The corresponding types areString.UTF8View
,String.UTF16View
, andString.UTF32View
. These types are specialized text sequence types distinct from normal binary data sequence types; though their elemental types of enumeration areUInt8
,UInt16
, andUInt32
, respectively. This means that it is not possible to projectString.UTF8View
and[UInt8]
between each other trivially; a copy must take place. (See https://developer.apple.com/documentation/swift/string/utf8view for further information.)Alternative proposals
Utf8Slice
Instead of introducing a
Utf8Char
type and allowingReadOnlySpan<Utf8Char>
to represent a slice of UTF-8 textual data, one could imagine introducing aUtf8Slice
type which is a thin wrapper aroundReadOnlySpan<Byte>
. Inspection or manipulation methods would operate on this type rather than exist as specialized extensions onMemoryExtensions
.Utf8Slice
would be indexable (withByte
as the elemental type).There is some prior art here in that it's similar to how the Go language operates. But this leads to a problem in that
Utf8Slice
instances would be limited in functionality. They'd be immutable, requiring manipulation APIs to bounce through a separate byte sequence and wrap a newUtf8Slice
around it. We'd have to determine if we'd want a heapable (ReadOnlyMemory<Byte>
-based) sibling type. There would be confusion as to why there's asymmetry between this and the UTF-16 types. After these and other considerations we're basically reinventing theUtf8String
proposal, so there's minimal benefit toUtf8Slice
as proposed here.Use
ReadOnlySpan<byte>
for everythingThis is tempting from the perspective of a system that wants to treat everything as pass-through as much as possible, but I don't believe it's appropriate from the perspective of a framework. There are two main issues I have with this approach.
The first is that it interferes with the general concept of a type system and makes it more difficult to reason about code. If a developer has a
Byte[]
in their code, they shouldn't need the additional bookkeeping overhead of asking themselves "does this represent binary data like a JPG, or does this represent UTF-8 text?" Text-based extension methods (Contains
,ToUpper
, etc.) also shouldn't begin appearing for arbitrary binary data sequences.The second is that this blurs the line between binary data and textual data, leading to the validation and conformance problems mentioned earlier. I don't want the framework to encourage developers to play fast and loose with this, potentially leading to undefined behavior in their applications. This is still subject to the earlier caveats: power developers should absolutely be able to project the data with minimal fuss, but this should be an affirmative action.
Due to the way this type is defined and the presence of implicit conversion operators, mathematical and comparison operators behave as expected. Some examples are given below.
Sample APIs for operating with
Utf8Char
sequences follow.Not shown in the above APIs are UTF-8 overloads for common networking APIs like
IPAddress.Parse(ReadOnlySpan<Utf8Char>)
, etc. It will require additional effort to enumerate the entire list of APIs which should be UTF-8 enlightened as part of this effort. For now this issue focuses mainly on text processing and manipulation APIs.Chunking APIs
While
ReadOnlySpan<char>
andReadOnlySpan<Utf8Char>
should represent standalone well-formed UTF-* sequences as much as possible, we must recognize that applications which work on discontiguous buffers cannot always guarantee this property. Often an application will be required to chunk a large piece of text into several smaller buffers due to performance considerations. This behavior is seen in existing Framework types likeStringBuilder
andReadOnlySequence<T>
.This chunking could occur such that slice boundaries occur in the middle of a multi-code unit sequence. In these cases the individual chunks may be ill-formed, but the logical concatenation of these chunks represent a well-formed supersequence. A UTF-16 example is the well-formed sequence
[ D808 DF45 ]
chunked into the ill-formed subsequences[ D808 ]
and[ DF45 ]
. A UTF-8 example is the well-formed sequence[ F0 92 8D 85 ]
chunked into the ill-formed sequences[ F0 ]
and[ 92 8D 85 ]
. The Framework should provideOperationStatus
-based APIs as much as possible to enable this scenario.Of special note is that some text operations cannot be performed in a chunked fashion. APIs like case conversion (
ToUpper
,ToLower
) and transcoding can be created to allow for chunking, but comparison APIs (CompareTo
,StartsWith
) do not allow for chunking. A concrete example of this follows. In this example, chunking will causeStartsWith
to return a false positive result.Since chunking is not unique to UTF-8, the Framework should provide chunking APIs for both UTF-8 and UTF-16 data. The Framework should also provide chunking APIs for transcoding routines. The existing
OperationStatus
type can be utilized for this purpose.