Closed GrabYourPitchforks closed 3 years ago
Even though byte / charu8 is the underlying elemental type of Utf8String, none of the APIs outside of the constructor actually take those types as input. The input parameter types to IndexOf and similar APIs is UnicodeScalar, which represents an arbitrary Unicode scalar value and can be 1 - 4 code units wide when transcoded to UTF-8.
Does that mean
var ss = s.Substring(s.IndexOf(','));
Would be a double traversal? i.e. any use of IndexOf
would lead to a double traversal for its return value to be meaningful?
Yes, I know this is dated from the future! :) It's our agenda and review doc for the in-person meeting before it goes to wider community review. Not everything is captured here, especially things related to runtime interaction.
@benaadams No, it's a single traversal, just like if s were typed as System.String
in your example. The IndexOf
is O(n) up to the first found ',' character (using a vectorized search if available), and the Substring
is O(n) from the indexed position to the end of the string. So the total number of bytes observed is index /* IndexOf */ + (Length - index) /* memcpy */
= Length
= single traversal.
But if IndexOf
is returning the number of UnicodeScalar
s which can be 1-4 bytes; passing that int
return value into Substring
doesn't it then have to rescan from the start of the Utf8String
to find that start position? i.e. IndexOf
isn't returning (int scalarPosition, int byteOffset)
APIs that operate on indices (like IndexOf
, Substring
, etc.) go by code unit count, not scalar count.
(I get that it might be confusing since enumeration of Utf8String
instances goes by scalar, not by code unit, so now we have a disparity on the type. That's why I'd proposed as an open question that maybe we kill the enumerator entirely and just have Bytes
and Scalars
properties, which removes the disparity.)
Thanks, Levi. Some questions/comments:
- Should be straightforward and O(1) to create a Utf8String instance from an existing String / ReadOnlySpan
or from a ReadOnlySpan coming in from the wire.
I don't understand how this is possible. With Utf8String as a reference type, getting the data into it will necessitate a memcpy at a minimum, which is not O(1).
Must allow querying total length (in code units) as O(1) operation.
I would expect a requirement would also be being able to query the total length in bytes in O(1) (which is also possible with string).
The five requirements below are drawn from String
This is already making some trade-offs. If I've read the data off the wire, I already have it in some memory, which I can then process as a ReadOnlySpan<byte>
. To use it as a Utf8String
, I then need to allocate and copy. So we're trading off usability for perf. I'm a bit surprised that's the right trade-off for the target audience, but the doc also doesn't specify who the target developers are, provide example scenarios for where/how this will be used, etc.
public ReadOnlySpan
Bytes { get; } public ReadOnlyMemory AsMemory();
Why is the to-memory conversion called AsMemory
but the to-span conversion called Bytes
?
public bool Contains(UnicodeScalar value);
I'm surprised not to see overloads of methods like Contains (IndexOf, EndsWith, etc.) that accept string
or char
. For char
, even if you add an implicit cast from char
to UnicodeScalar
, we just had that discussion about not relying on implicit casts from a usability perspective in cases like this. And for string
, with the currently defined methods someone would need to actually convert a string
to a Utf8String
, which is not cheap, in order to call these methods.
public int IndexOfAny(ReadOnlySpan
value); public int LastIndexOfAny(ReadOnlySpan value);
string.{Last}IndexOfAny
calls this argument anyOf
.
public Utf8String ToLowerInvariant(); public Utf8String ToUpperInvariant();
Presumably Utf8String
will have culture support and will also have ToLower/Upper
methods that are culture-sensitive?
public int IndexOf(UnicodeScalar value); public int IndexOf(UnicodeScalar value, int startIndex);
What does the return value mean? Is that the number of the byte offset of the UnicodeScalar
, or is it the number of the UnicodeScalar
? Similarly, for startIndex
. Assuming it's the number of UnicodeScalars
, if I want to get Bytes
and index into it starting at this UnicodeScalar
, how do I convert that UnicodeScalar
-offset to a byte offset?
Once culture support comes online, we should add CompareTo and related APIs.
From a design discussion perspective, I would think we'd want this outline to represent the ultimate shape we want, and the implementations can throw NotImplementedException until the functionality is available (before it ships).
public readonly struct UnicodeScalar
What's the plan for integration of this with the existing unicode support in .NET? For example, how do I get a System.Globalization.UnicodeCategory
for one of these?
public readonly struct Utf8StringSegment
Similar questions related to the APIs on Utf8String
.
And, presumably we wouldn't define any APIs (outside of Utf8String
/Utf8StringSegment
) that accept a Utf8String
, instead accepting a Utf8StringSegment
, since the former can cheaply convert to the latter but not vice versa?
For me, it also begs the question why do we need both? If we're going to have Utf8StringSegment
, presumably that becomes the thing that most APIs would be written in terms of, because it can cheaply represent both the whole and slices. And once you have that, which effectively has the same surface area as Utf8String
, why not just make it Utf8String
, still as a struct, and get rid of the class-equivalent and duplication. It can then be constructed from a byte[]
or a ReadOnlyMemory<byte>
without any extra allocation or copying, can be cheaply sliced, etc. Utf8StringSegment
(when named Utf8String
) is then essentially as a nice wrapper / package for a lot of the functionality that exists in System.Memory as static methods.
n.b. This type is not pinnable because we cannot guarantee null termination.
I don't see why we'd place this restriction. Arrays don't guarantee null termination but are pinnable. Lots of types don't guarantee null termination but are pinnable.
// Pass a Utf8String instance across a p/invoke boundary
I would hope that before or as part of enabling this, we add support for Span<T>
and ReadOnlySpan<T>
. We still have debt to be paid down there and should address that before adding this as well.
Culture-aware processing code is currently implemented in terms of UTF-16 across all platforms. We don't expect this to change appreciably in the near future, which means that any operations which use culture data will almost certainly require two transcoding steps, making them expensive for UTF-8 data.
I didn't understand this part. Don't both Windows and ICU provide UTF8-based support in addition to the UTF16-based support that's currently being used?
Other stuff
Equivalents for String.Format
?
Don't both Windows and ICU provide UTF8-based support in addition to the UTF16-based support that's currently being used?
Not that I know of. Windows is, with very good legacy reasons, very UTF-16/UCS-2 focused.
What about Equals(string other)
or CompareTo(string other)
?
Seems like not implementing this would make it difficult for existing ecosystems to adopt this type.
ReadOnlySpan<some_type>
, and C# should support conveniently creating literals of this span on the stack, e.g. (pseudocode): myString.EndsWith(stackalloc u8"World!"). Currently all the APIs that Utf8String (which allocates) and scalar (which is a single "char", i.e. not super useful).ReadOnlySpan<Char>
as a representation of a slice of UTF16 string. You are proposing we use Utf8StringSegment. Is the discrepancy ok?The signature public Utf8String[] Split(Utf8String separator)
implies a lot of allocations and memory copies.
First, an array must be allocated for the return value.
Then, each element in the array must be a copy of each match, into a newly-allocated buffer, as Utf8String
mandates null-termination but the input will not have nulls after each separator.
If I understand this correctly, except for the trivial case when the separator is not present at all, this signature would basically require copying the whole input string.
Would it make sense to return a custom enumerator of Utf8StringSegment
instead, similar to SplitByScalarEnumerator
or SplitBySubstringEnumerator
?
I think the biggest issue with the proposed API is confusion between UTF8 code units and Unicode scalar values, especially when it comes to lengths and indexes. Would it make sense to alleviate that confusion by more explicit names, like ByteLength
instead of Length
or startByteIndex
instead of startIndex
?
[EditorBrowsable(EditorBrowsableState.Never)]
public static Utf8String DangerousCreateWithoutValidation(ReadOnlySpan<byte> value);
Is EditorBrowsableState.Never
the right way to hide dangerous methods? I don't like it, because it means such methods are hard to use, when I think the actual goal is to limit their discoverability, not their usability. Wouldn't putting them into a separate type be a better solution, similar to how dangerous Span
APIs were put into the MemoryMarshal
type?
One potential workaround is to make the JIT recognize a
ldstr
opcode immediately followed by anewobj Utf8String(string)
opcode. This pattern can be special-cased to behave similarly to the standaloneldstr
today, where the address of the literalString
(orUtf8String
) object is known at JIT time and a singlemov reg, imm
instruction is generated.
Would this mean that if I write new Utf8String("foo")
, which would produce the same sequence of opcodes, it might not actually create a new instance of Utf8String
? I think that would be very confusing, since it's not how any other type behaves, not even string
. It would also be a violation of the C# specification, which says that for a class, new
has to allocate a new instance:
The run-time processing of an object_creation_expression of the form
new T(A)
, […] consists of the following steps:
- If
T
is a class_type:
- A new instance of class
T
is allocated. […]
What is the relationship between UnicodeScalar
and Rune
(https://github.com/dotnet/corefx/issues/24093)?
We can also consider introducing a type
StringSegment
which is theString
-backed analog of this type.
There was an issue about creating StringSegment
in corefx, which was closed a month ago, with the justification that ReadOnlyMemory<char>
and ReadOnlySpan<char>
are good enough: https://github.com/dotnet/corefx/issues/20378. Does that mean it's now on the table again?
The code comments on the
StringSegment
type go into much more detail on the benefits of this type when compared toReadOnlyMemory<T>
/ReadOnlySpan<T>
.
Where can I find those comments? I didn't find the StringSegment
type in any dotnet repo.
More generally, with this proposal we will have: string
, char[]
, Span<char>
, ReadOnlySpan<char>
, Memory<char>
, ReadOnlyMemory<char>
, Utf8String
, byte[]
, Span<byte>
, ReadOnlySpan<byte>
, Memory<byte>
and ReadOnlyMemory<byte>
. Do we really need Utf8StringSegment
as yet another string-like type?
I don't understand how this is possible. With Utf8String as a reference type, getting the data into it will necessitate a memcpy at a minimum, which is not O(1).
Yes, this is a typo.
I would expect a requirement would also be being able to query the total length in bytes in O(1) (which is also possible with string).
This is possible via Utf8String.Length
or Utf8String.Bytes.Length
, both of which return the byte count.
I'm surprised not to see overloads of methods like Contains (IndexOf, EndsWith, etc.) that accept string or char.
I struggled with this, and the reason I ultimately decided not to include it is because I think the majority of calls to these methods involve searching for literal substrings, and I'd rather rely on a one-time compiler conversion of the search target from UTF-16 to UTF-8 than a constantly-reoccurring runtime conversion from UTF-16 to UTF-8. I'm concerned that the presence of these overloads would encourage callers to inadvertently use a slow path that requires transcoding. We can go over this in Friday's discussion.
What's the plan for integration of [UnicodeScalar] with the existing unicode support in .NET?
I had planned APIs like UnicodeScalar.GetUnicodeCategory()
in a future release, but we can go over them in Friday's meeting.
We use ReadOnlySpan
as a representation of a slice of UTF16 string. You are proposing we use Utf8StringSegment. Is the discrepancy ok?
Check the comment at the top of https://github.com/dotnet/corefxlab/blob/utf8string/src/System.Text.Utf8/System/Text/StringSegment.cs. It explains in detail why I think this type provides significant benefits that we can't get simply from using ReadOnlySpan<char>
.
It would also be a violation of the C# specification, which says that for a class, new has to allocate a new instance.
We do violate the specification in a few cases. For instance, new String(new char[0])
returns String.Empty
. Not a new string that happens to be equivalent to String.Empty
- the actual String.Empty
instance itself. Similarly, the Roslyn compiler can sometimes optimize new
statements away. See for example https://github.com/dotnet/roslyn/commit/13adbac980ba771d8128449476b6b00021cde203.
What is the relationship between UnicodeScalar and Rune (dotnet/corefx#24093)?
UnicodeScalar
is validated: it is contractually guaranteed to represent a value in the range U+0000..U+D7FF
or U+E000..U+10FFFF
. Scalars have unique transcodings to UTF-8 and UTF-16 code unit sequences. Such transcoding operations are guaranteed always to succeed. Rune
(which is not in this proposal) wraps a 32-bit integer which is ostensibly a Unicode code point value but which is not required to be valid. This means that developers consuming invalid Rune
instances must be prepared for some operations on those instances to fail.
@GrabYourPitchforks
For instance,
new String(new char[0])
returnsString.Empty
. Not a new string that happens to be equivalent toString.Empty
- the actualString.Empty
instance itself.
I didn't know that, interesting.
Similarly, the Roslyn compiler can sometimes optimize
new
statements away. See for example https://github.com/dotnet/roslyn/commit/13adbac980ba771d8128449476b6b00021cde203.
As far as I can tell, that commit is about Span<T>
, which is a struct
, so it doesn't violate the C# specification.
UnicodeScalar
is validated: it is contractually guaranteed to represent a value in the rangeU+0000..U+D7FF
orU+E000..U+10FFFF
. […]Rune
(which is not in this proposal) wraps a 32-bit integer which is ostensibly a Unicode code point value but which is not required to be valid.
That doesn't sound like a good enough reason to have two different types to me, especially since you can create an invalid UnicodeScalar
. Maybe the two groups could work together to create a single type for representing Unicode scalar values?
As far as I can tell, that commit is about Span
, which is a struct, so it doesn't violate the C# specification.
new byte[] { ... }
isn't a struct type. :)
That doesn't sound like a good enough reason to have two different types to me
This proposal assumes that Rune
is never committed. So there's only one type in the end.
I see that it's already committed, but can I just go on record as saying that UnicodeScalar
is just a plain terrible name? It really is. It's long, it's generic enough to mean nearly nothing, and it is not even a term the Unicode group uses. I had the same complaints about Rune
(with the exception that Rune
is at least short`).
This type really ought to be named Character
or CodePoint
.
I'm mostly OK with the rest of it, though it would be nice if .Split
didn't have to allocate quite as much. The underlying data is already read-only - can't Span<T>
be used here or something?
@whoisj
I see that it's already committed, but can I just go on record as saying that
UnicodeScalar
is just a plain terrible name? It really is. It's long, it's generic enough to mean nearly nothing, and it is not even a term the Unicode group uses.
"Unicode Scalar Value" is the term Unicode uses for this.
This type really ought to be named
Character
orCodePoint
.
"Character" doesn't really mean anything (Unicode lists 4 different meanings) and would be easily confused with System.Char
/char
.
"Code Point" is closer, but that term includes invalid Unicode Scalar Values (the range from U+D800 to U+DFFF).
It's long, ...
The question is, what would the C# keyword be? (Int32
vs int
); something like uchar
is short 😉 or nchar
to match databases
The question is, what would the C# keyword be? (Int32 vs int); something like uchar is short 😉 or nchar to match databases
This.
Will there be a language word for the type? If there is, you can call the type ThatUnicodeValueWhichNobodyCouldAgreeOnAGoodNameForSoThisIsIt
for all I care. I vote for c8
but I also like Rust. Keeping C# in mind, uchar
seems the like to no-brainer to me.
@svick yeah, I know that "chartacter" is nearly meaningless hence my suggesting it. I prefer "code point" because how on Earth are you going to prevent me from writing invalid values to a UnicodeScalar
's memory? Preventing unsafe
is a recipe for a performance disaster; and making unsafe (the real meaning of the word) assumptions about what values a block of memory can contain will lead to fragile and exploitable software design.
how on Earth are you going to prevent me from writing invalid values to a UnicodeScalar's memory?
Nobody's stopping you. In fact, there's a public static factory that skips validation and allows you to create such an invalid value. But if you do this you're now violating the contractual guarantees offered by the type, I'd recommend not doing this. :)
To be clear, creating an invalid UnicodeScalar
won't AV the process or anything quite so dire. But it could make the APIs behave in very strange and unexpected manners, leading to errors on the consumption side. For example, UnicodeScalar.Utf8SequenceLength
could return -17 if constructed from invalid input. Such are the consequences of violating invariants.
Unlike the UnicodeScalar
type, the Utf8String
type specifically does not offer a contractual guarantee that instances of the type contain only well-formed UTF-8 sequences.
In fact, there's a public static factory that skips validation and allows you to create such an invalid value.
Sure, great, but a lot of the data being read into these structures will be coming from external sources. Very happy to hear that there's no validation steps being taking as the data is read in (because it would be horribly expensive), but still very concerned about:
But if you do this you're now violating the contractual guarantees offered by the type
BUT there is no guarantee - you've said so in your previous statement. There's an assumption, but no guarantee; so let's be careful how we describe this.
The Utf8String
and UnicodeScalar
types make different contractual guarantees. I'll try to clarify them.
The Utf8String
type encourages but does not require the caller to provide it a string consisting of only valid UTF-8 sequences. All APIs hanging off it have well-defined behaviors even in the face of invalid input. For example, enumerating scalars over an ill-formed Utf8String
instance will return U+FFFD
when an invalid subsequence is encountered. (Not just that, but the number of bytes we skip in the face of an invalid subsequence is also well-defined and predictable.) This extends to ToUpperInvariant()
/ ToLowerInvariant()
and other manipulation APIs. Their behavior is well-defined even in the face of invalid input.
Exception: If you construct a Utf8String
instance and use unsafe code or private reflection to manipulate its data after it has been constructed, the APIs have undefined behavior.
The UnicodeScalar
type requires construction from a Unicode scalar value. The API behavior is only well-defined when the instance itself is well-formed. If the caller knows ahead of time that the value it's providing is valid, it can call the "skip validation" factory method. If the instance members off a UnicodeScalar
instance misbehave, it means that the caller who originally constructed it violated an invariant at construction time.
The reason for the difference is that it's going to be common to construct a Utf8String
instance from some unknown data coming in over i/o. It's not common to construct a UnicodeScalar
instance from arbitrary data. Instances of this type are generally constructed from enumerating over UTF-8 / UTF-16 data, and significant bit twiddling needs to happen during enumeration anyway in order to transcode the original data stream into a proper scalar value. Detection of invalid subsequences would necessarily need to occur during enumeration, which means the caller already has the responsibility of fixing up invalid values. The "skip validation" factory is simply a convenience for callers who have already performed this fixup step to avoid the additional validation logic in hot code paths.
So when I use the term "contractual guarantee", it's really shorthand for "This API behaves as expected as long as the caller didn't do anything untoward while constructing the instance. If the API misbehaves, take it up with whoever constructed this instance, as they expressly ignored the overloads that tried to save them from themselves and went straight for the 'I know what I'm doing' APIs."
FWIW, the reason for this design is that it means that consumers of these types don't have to worry about any of this. Just call the APIs like normal and trust that they'll give you sane values. If you take UnicodeScalar
as a parameter to your method, you don't need to perform an IsValid
check on it before you use it. Rely on the type system's enforcement to have prohibited the caller from even constructing a bad instance in the first place. (Modulo the caller doing something explicitly marked as dangerous, of course.)
This philosophy is different from the Rune
proposal, where if you take a Rune
as a parameter into your method you need to perform an IsValid
check as part of your normal parameter validation logic since there's otherwise no guarantee that the type was constructed correctly.
I suppose those are safe-enough trade-offs. Still, too bad the name has to be so unwieldy. :man_shrugging:
The name doesn't have to be unwieldy. If there's consensus that it should be named Rune
or similar, I'll relent on the naming. :)
We should not call it a "Rune" if it's not a representation for the Unicode Code Point, i.e. let's not hijack a good unambiguous term and use it for something else.
ᚺᛖᛚᛚᛟ᛫ᚹᛟᚱᛚᛞ Are you sure about Rune? It's a Unicode Block after all. Maybe rather grapheme?
I think in graphemics (branch of science studying writing) rune is indeed a grapheme. I think in software engineering, rune is a code point. But possibly it's not such a clear cut as I think. The point I was trying to make is using "rune" to mean Unicode Scalar would be at least yet another overload of the word "rune".
Thank you for clarifying that.
The UnicodeScalar
is a more sane type to use than char
as it always encompasses a full character rather than potentially half a character, which char
can do.
However, while UnicodeScalar
is fine for a library type it isn't great for common usage for people that don't like var
as you'd go from the less correct
foreach (char ch in str)
{
// ...
}
to the more correct
foreach (UnicodeScalar scalar in ustr)
{
// ...
}
which is a less desirable, very verbose, code representation for a single character.
Aside for reusable library developers, I don't think many people will be using (referring to) UnicodeScalar. Most people will create or get an instance of Utf8String, and the instance methods on this type, and most of these either don't use UnicodeScalar, or will be used by passing a literal ('a' or 65). If somehow the type becomes super popular, we can think about adding a language alias.
I think you might be underestimating the usage...
ReadOnlySpan<UnicodeScalar> values = new UnicodeScalar[] {'😊','😎','😥','🎄'};
int index = str.IndexOfAny(values);
Or iterating over chars to determine what emoji-range the character fits into for "sentiment" analysis
Yeah, my teenage daughter also thinks I underestimate the value and usage of emojis :-) And I think you both might be right :-)
Unless you are doing interop; Utf8String
is probably going to be the go to string type. Its smaller when ascii; deals with whole characters and is more efficient when transferring on the wire (as its already in the correct format so needs no transcoding)
Whereas string
is twice the size when ascii and deals in half characters; you have to worry about byte order; and often you have to repeatedly transcode to utf8. In my experience people just ignore the half-character issue so most current string
handling is wrong*. Using Utf8String
means most string handling would be correct by default.
So I think use of UnicodeScalar
may be higher than anticipated.
*From my limited observations
However, while UnicodeScalar is fine for a library type it isn't great for common usage for people that don't like var as you'd go from the less correct
QFT. This is my issue, @benaadams has hit the nail on the head here. Fairly sure we cannot expect char
and string
to be co-opted by UnicodeScalar
and Utf8String
but it would be lovely if they could be.
Disagree with the idea that this won’t be used extensively: it will be used everywhere a char is used today. Almost every use of Char needs to be replaced by this.
You can search for ‘char’ in any .NET code base to get an idea of how often people use this primitive type.
char is essentially broken for proper processing, we only survive because most people brush this off as ‘something went wrong on a corner case’.
Basically, .NET today encourages subtly broken code by default which you can make correct with a lot of work.
We need to strive to create an environment where we make it easy for people to write correct code from the start, for them to fall into the pit of success.
Long names like ‘UnicodeScalar’ are just going to prevent people from embracing it by default. The notion that “we will wait and see if there is demand” is a self fulfilling prophesy into failure.
Research wise, just grep for ‘rune’ in the go codebase to show you how misguided the idea that his is a rare type.
Go here ensures that any beginner gets correct code from the start. We ended up in the other extreme.
Yeah, my teenage daughter also thinks I underestimate the value and usage of emojis :-)
Windows even has an emoji keyboard...
We've seen a huge rise in the use of emoji and the odder variants of unicode (script, upside down letters, lookalike chars etc) in our consumer focused applications. Initially we we resistant to it; but at this point its basically time to wholly embrace it as just the way things are.
So we will be moving to Utf8String
for everything when its available (other than system calls); and using it for all new projects.
Also ASP.NET Core should consider moving precompiled Razor pages to be Utf8String
based rather than string
based and going though Encoding.UTF8
for every static string in every page request.
Also all these methods are broken for supplementary plane chars/emoji
partial class String
{
bool Contains(char ...);
int IndexOf(char ...);
int IndexOfAny(char[] ...);
bool EndsWith(char ...);
string Join<T>(char, IEnumerable<T>);
string Join(char, ...);
int LastIndexOf(char, ...);
int LastIndexOfAny(char, ...);
string PadLeft(int, char);
string PadRight(int, char);
string Replace(char, char);
string[] Split(char ...);
string[] Split(char[] ...);
bool Trim(char);
bool Trim(char[]);
bool TrimEnd(char);
bool TrimEnd(char[]);
bool TrimStart(char);
bool TrimStart(char[]);
}
So should the following additional overloads be added to regular string
?
partial class String
{
bool Contains(UnicodeScalar ...);
int IndexOf(UnicodeScalar ...);
int IndexOfAny(ReadOnlySpan<UnicodeScalar> ...);
bool EndsWith(UnicodeScalar ...);
string Join<T>(UnicodeScalar, IEnumerable<T>);
string Join(UnicodeScalar, ...);
int LastIndexOf(UnicodeScalar, ...);
int LastIndexOfAny(UnicodeScalar, ...);
string PadLeft(int, UnicodeScalar);
string PadRight(int, UnicodeScalar);
string Replace(UnicodeScalar, UnicodeScalar);
string[] Split(UnicodeScalar ...);
string[] Split(ReadOnlySpan<UnicodeScalar> ...);
bool Trim(UnicodeScalar);
bool Trim(ReadOnlySpan<UnicodeScalar>);
bool TrimEnd(UnicodeScalar);
bool TrimEnd(ReadOnlySpan<UnicodeScalar>);
bool TrimStart(UnicodeScalar);
bool TrimStart(ReadOnlySpan<UnicodeScalar>);
}
Or would they go via an implicit conversion to string; and the string overloads? Or would the C# compiler change the embedded type from UnicodeScalar
to string
depending on what overload was available at the call site?
This would have been popular/memed recently for some unknown reason:
string.Join('👏', words)
A example of why we should use UnicodeScalar is the C# compiler. It has a bug on using surrogate pairs:
class Program
{
static void Main()
{
// Error CS1056 Unexpected character
int 𩸽 = 2; // CJK Extension B
int 𒀀 = 3; // Cuneiform
int 𓀀 = 5; // Egyptian Hieroglyph
System.Console.WriteLine(𩸽 * 𒀀 * 𓀀);
}
}
Unicode category of these characters are Lo (Other Letter). And Lo characters can be used for identifiers in the C# spec.
Many language other than C# can use surrogate pairs correctly.
Go:
package main
import "fmt"
func main() {
𩸽 := 2
𒀀 := 3
𓀀 := 5
fmt.Println(𩸽 * 𒀀 * 𓀀)
}
Java:
public class HelloWorld
{
public static void main(String[] args)
{
int 𩸽 = 2;
int 𒀀 = 3;
int 𓀀 = 5;
System.out.print(𩸽 * 𒀀 * 𓀀);
}
}
Let me clarify: I definitely don't think we should be using "char" or any such type in Utf8String APIs. The APIs need to be 100% reliable, and UnicodeScalar is the only way to do it, i.e I am a big fan of UnicodeScalar. What i was saying is that I don't think users will have to refer to the type often, e.g utf8String.Split('a') would call a method taking UnicodeScalar, and C# compiler would target type the parameter, i.e. there would be not Char created and then converted to UnicodeScalar; C# would create scalar value directly.
[Edit]: let me also motivate why I even make the claim: some people commented that UnicodeScalar is a long name and that we need a language alias for it. I think the type will not be referred to often enough to justify adding language alias, at least not initially. If we discover that people refer to it all the time and it becomes bread and butter of C# programming, we can always add an alias later.
The APIs need to be 100% reliable, and UnicodeScalar is the only way to do it
Agreed on the concept that a 32-bit Unicode scalar type needs to be present, and the string (notice not string
) types need to utilize that and not the 16-bit char
type.
I think the type will not be referred to often enough to justify adding language alias, at least not initially.
I disagree. I tend to have code littered with char[]
and const char
declarations. I just opened a random project and did a search for \bchar\b
and in 72 files I got 155 hits. Having to type / look at UnicodeScalar
in place of all of those char
keywords would be... uh... less than optimal, yes - let's put it that way because it sound pleasant and professional: "less than optimal".
Since the question of graphemes came up, I'll mention that we've been punting on the idea of having graphemes as a first-class citizen in the framework. (By "grapheme", I mean interpreting the 2-scalar sequence [ U+1F474 Older Man Emoji, U+1F3FF Fitzpatrick Type 6 Skin Modifier ] as the single grapheme "Older man with Fitzpatrick type 6 skin".) The reason we had been punting on this is that it tends to be more of a UI / text editor concern - not a general framework concern - and there is the existing TextElementEnumerator
type if you're willing to pull up your sleeves.
@migueldeicaza, I think you had some early thoughts on this a while back. Has your thinking changed on what kind of support we should have in-box for this? Is this really a concern for the BCL, or does it properly belong in a separate package?
Some other feedback now that I'm rereading this thread.
@benaadams - I agree that we should add UnicodeScalar
-accepting APIs to System.String
if there's a need to do so. If the majority use case is that developers are searching for literals, the existing APIs like String.IndexOf("<my multi-char emoji>")
already work just fine.
@whoisj - Regarding your const char
fields, what if the compiler could implicitly convert char
to UnicodeScalar
? That way your call site would look like myUtf8String.IndexOf(my_const_char)
. We can also consider adding an explicit conversion operator between the two types.
Some comments on the API from last week:
Given that it should be possible to create Utf8Strings from invalid operations, we need a way of returning a value that indicates that there was an error processing the utf8 sequence in the buffer and indicate that this was caused due to this error. NStack has a port of Go's libraries that do this.
It is not clear why Length of Runes should be O(1), seems like a waste of space, specially considering that iterating over the values is not O(1) anyways.
When processing utf8strings you really want to have access to the byte-length, that is missing.
I would add a few things:
Utf8String
to Rune array (spit it out as an array of Runes, or an IListLength
is, whether it is the number of bytes in the buffer, or the number of runes on it. Intuitively it is the number of bytes, if so, we should add a RuneCount
property that returns the number of runes.There is a proposal for UnicodeScalar, you should lift the operations I submitted before on a better named type, Rune that has a comprehensive API that I have been using for a while.
I don’t think that it is a good idea to limit UnicodeScalar to valid values, I think you should instead have an IsValid method.
My feeling is that if you want a string with training wheels we have System.String already - but a case should be made for a System.UnicodeString that is made up 32-bit tunes, that has O(1) indexing capabilities.
One nice capability that the Go API has is that enumerating over the Runes in the string is not limited to obtaining the individual runes, but also the offset where the rune was found. In NStack, I have a similar method that returns a tuple (int index, Rune rune) that achieves this.
See also the UTF-8 string scenario and design philosophy document (https://github.com/dotnet/corefxlab/issues/2368).
There's some confusion over runtime complexity of the Utf8String
length APIs. Fetching the code unit count ("byte length", if you will) is O(1) complexity. The Utf8String
type internally doesn't keep track of the total number of scalars ("runes", if you will), and there's no API on Utf8String
to fetch this count. There are other APIs which will give you this information, but the proposal here doesn't expose those APIs on Utf8String
.
I've added the missing APIs to the Tuesday review. Thanks for the eagle eyes @migueldeicaza!
As a note, it seems to me that Utf8String.Length
ought to be the byte length of the underlying array (it is an array, isn't it - are we considering using linked buffers or something - doesn't really matter).
Given that Utf8String
are likely immutable, and therefore the count of characters cannot be cached in the type after initialization; and nobody in their right mind would suggest computing the actual character count of a Utf8String
at allocation, we likely need a .Count()
method which does the calculation when invoked.
The Utf8String.Enumerator
implementation ought to be interesting. There are a lot of way to do this, none of which are particularly pleasant. 😕
@GrabYourPitchforks if you looking for any parallel implementations (aside from the stuff by @migueldeicaza) let me know and I can send you a few links to internal source we use to handle Git strings (Git is Utf8 through-and-through) - ironically, we called our type StringUtf8
😏
Also, if you're just plain sick of my feedback let me know and I'll go sulk in a corner quietly :grin:
As a complete, and mostly unrelated side-note, I've always hate the String
API which take StringComparison
. I would so incredibly rather every API took a StringComparer
implementation and we could leave the annoying, and mostly useless StringComparision
enumeration in the bin.
Any chance we can avoid dragging it into the future via this API set? 🙏 :bow: 🙏
My feeling is that if you want a string with training wheels we have System.String already - but a case should be made for a System.UnicodeString that is made up 32-bit tunes, that has O(1) indexing capabilities.
On this topic, I cannot count the number of times that a p/invoke call to some library has returned corrupted or invalid string
values; and since System.String
lacks any ability to self validate, I always end up adding validation routines to project. Seems like something we ought to avoid in the future, ala @migueldeicaza recommendation (this is what I was trying, and failed to illuminate above).
Oh, and in case it needs to be said: much ❤️ and 🙇 for @GrabYourPitchforks for even working on this API. It is long over due and is such a hot topic; it's pure heroism to work on it. 😃
@whoisj By self-validate a System.String
instance, you mean looking for mismatched surrogate pairs? I considered in this project adding validation APIs to both Utf8String
and String
but ultimately decided against it for a few reasons. I didn't want developers to feel obligated to call it before consuming the instances. And for String
in particular, it's generally very difficult to create a malformed instance in the first place without bit-twiddling a char[]
. If you're running into the need to validate in a production app I'm certainly willing to reconsider those decisions.
@whoisj What's your concern with StringComparison
? Are your scenarios working with specific cultures rather than the invariant culture or the current thread's culture?
@migueldeicaza Interestingly, we considered a fully UTF-32 string type a few weeks ago, and I don't think it's a crazy idea. The primary scenario we came up with was a text editor or other UI-based application. I think if we wanted to give that scenario proper respect we'd also want to consider grapheme representation in the framework and plumb it through as an in-box concept. Do you think server applications might need this in addition to UI applications? (As an aside, C++'s std::wstring
on most non-Windows platforms is UTF-32, and it seemingly doesn't enjoy wide use.)
@whoisj What's your concern with StringComparison? Are your scenarios working with specific cultures rather than the invariant culture or the current thread's culture?
More often that not, I work on library code that interops with external software. I primarily am concerned with Orignal
and OrdinalCaseInsensitive
, very-very rarely do I need to care about culture.
Often, I need to produce custom string comparers, when this happens every entry-point on string that takes a StringComparison
becomes useless to me and I end-up re-implementing them.
What kind of custom string comparer could one be writing that isn't provided by NetFx? Well, several projects I've worked on recently needed custom file system path comparers, and for the compare to chosen based on conditions. For example, on Windows paths that do not begin with "\?\" treat '/'
and '\'
interchangeably, thus a custom comparer is necessary. I could just
OrdinalCaseInsensitive.Equals(lhsPath.Replace('/', '\\'), rhsPath.Replace('/', '\\'))
but we literally compare thousands of paths in certain cases and thrashing the HEAP with needless string allocations is really terrible. So instead, I have chunking logic which breaks on path separators... blah, blah, blah.
Now consider the logic necissary for something like bool IsChildPath(string parentPath, string childPath)
. Internally I could use string.StartsWith(...)
but it doesn't accept a StringComparer
, instead it takes a StringComparison
. Leaving me to author my own static bool StartsWith(this String value, StringComparer comparer)
method. If System.String.StartsWith
accepted a StringComparer
life would just be better.
@whoisj By self-validate a System.String instance, you mean looking for mismatched surrogate pairs?
Exactly.
I didn't want developers to feel obligated to call it before consuming the instances.
Developers should not feel obligated to use validation API, but the lack of a validation API can cause heartburn. Consider the developer who is writing software that reads data from a stream or shared memory. There's always a change something got corrupted, so having a built-in way to validate the data would be rather useful. Perhaps developers writing code like this are rare enough that NetFx doesn't need it, in which case I'll continue to keep writing my own. 😁
... we considered a fully UTF-32 string type a few weeks ago, and I don't think it's a crazy idea. Do you think server applications might need this in addition to UI applications?
I can see utility in an indexable string type, but it'll be very specialized. I'd much rather see the work your doing here stay the focus. UTF-8 as the internal encoding for character data is extremely valuable, especially when memory isn't plentiful and cheap.
... oh, as an aside - are there going to be Utf8StringComparer
types provided? If so, have you thought about the implementation details yet?
Utf8String design discussion - last edited 14-Sep-19
Utf8String design overview
Audience and scenarios
Utf8String
and related concepts are meant for modern internet-facing applications that need to speak "the language of the web" (or i/o in general, really). Currently applications spend some amount of time transcoding into formats that aren't particularly useful, which wastes CPU cycles and memory.A naive way to accomplish this would be to represent UTF-8 data as
byte[]
/Span<byte>
, but this leads to a usability pit of failure. Developers would then become dependent on situational awareness and code hygiene to be able to know whether a particularbyte[]
instance is meant to represent binary data or UTF-8 textual data, leading to situations where it's very easy to write code likebyte[] imageData = ...; imageData.ToUpperInvariant();
. This defeats the purpose of using a typed language.We want to expose enough functionality to make the
Utf8String
type usable and desirable by our developer audience, but it's not intended to serve as a full drop-in replacement for its sibling typestring
. For example, we might addUtf8String
-related overloads to existing APIs in theSystem.IO
namespace, but we wouldn't add an overloadAssembly.LoadFrom(Utf8String assemblyName)
.In addition to networking and i/o scenarios, it's expected that there will be an audience who will want to use
Utf8String
for interop scenarios, especially when interoperating with components written in Rust or Go. Both of these languages use UTF-8 as their native string representation, and providing a type which can be used as a data exchange type for that audience will make their scenarios a bit easier.Finally, we should afford power developers the opportunity to improve their throughput and memory utilization by limiting data copying where feasible. This doesn't imply that we must be allocation-free or zero-copy for every scenario. But it does imply that we should investigate common operations and consider alternative ways of performing these tasks as long as it doesn't compromise the usability of the mainline scenarios.
It's important to call out that
Utf8String
is not intended to be a replacement forstring
. The standard UTF-16string
will remain the core primitive type used throughout the .NET ecosystem and will enjoy the largest supported API surface area. We expect that developers who useUtf8String
in their code bases will do so deliberately, either because they're working in one of the aforementioned scenarios or because they find other aspects ofUtf8String
(such as its API surface or behavior guarantees) desirable.Design decisions and type API
To make internal
Utf8String
implementation details easier, and to allow consumers to better reason about the type's behavior, theUtf8String
type maintains the following invariants:Instances are immutable. Once data is copied to the
Utf8String
instance, it is unchanging for the lifetime of the instance. All members onUtf8String
are thread-safe.Instances are heap-allocated. This is a standard reference type, like
string
andobject
.The backing data is guaranteed well-formed UTF-8. It can be round-tripped through
string
(or any other Unicode-compatible encoding) and back without any loss of fidelity. It can be passed verbatim to any other component whose contract requires that it operate only on well-formed UTF-8 data.The backing data is null-terminated. If the
Utf8String
instance is pinned, the resultingbyte*
can be passed to any API which takes aLPCUTF8STR
parameter. (Likestring
,Utf8String
instances can contain embedded nulls.)These invariants help shape the proposed API and usage examples as described throughout this document.
Non-allocating types
While
Utf8String
is an allocating, heap-based, null-terminated type; there are scenarios where a developer may want to represent a segment (or "slice") of UTF-8 data from an existing buffer without incurring an allocation.The
Utf8Segment
(alternative name:Utf8Memory
) andUtf8Span
types can be used for this purpose. They represent a view into UTF-8 data, with the following guarantees:These types have
Utf8String
-like methods hanging off of them as instance methods where appropriate. Additionally, they can be projected asROM<byte>
andROS<byte>
for developers who want to deal with the data at the raw binary level or who want to call existing extension methods on theROM
andROS
types.Since
Utf8Segment
andUtf8Span
are standalone types distinct fromROM
andROS
, they can have behaviors that developers have come to expect from string-like types. For example,Utf8Segment
(unlikeROM<char>
orROM<byte>
) can be used as a key in a dictionary without jumping through hoops:Utf8Span
instances can be compared against each other:An alternative design that was considered was to introduce a type
Char8
that would represent an 8-bit code unit - it would serve as the elemental type ofUtf8String
and its slices. However,ReadOnlyMemory<Char8>
andReadOnlySpan<Char8>
were a bit unweildy for a few reasons.First, there was confusion as to what
ROS<Char8>
actually meant when the developer could useROS<byte>
for everything. WasROS<Char8>
actually providing guarantees thatROS<byte>
couldn't? (No.) When would I ever want to use a loneChar8
by itself rather than as part of a larger sequence? (You probably wouldn't.)Second, it introduced a complication that if you had a
ROM<Char8>
, it couldn't be converted to aROM<byte>
. This impacted the ability to perform text manipulation and then act on the data in a binary fashion, such as sending it across the network.Creating segment types
Segment types can be created safely from
Utf8String
backing objects. As mentioned earlier, we enforce that data in the UTF-8 segment types is well-formed. This implies that an instance of a segment type cannot represent data that has been sliced in the middle of a multibyte boundary. Calls to slicing APIs will throw an exception if the caller tries to slice the data in such a manner.The
Utf8Segment
type introduces additional complexity in that it could be torn in a multi-threaded application, and that tearing may invalidate the well-formedness assumption by causing the torn segment to begin or end in the middle of a multi-byte UTF-8 subsequence. To resolve this issue, any instance method onUtf8Segment
(including its projection toROM<byte>
) must first validate that the instance has not been torn. If the instance has been torn, an exception is thrown. This check is O(1) algorithmic complexity.It is possible that the developer will want to create a
Utf8Segment
orUtf8Span
instance from an existing buffer (such as a pooled buffer). There are zero-cost APIs to allow this to be done; however, they are unsafe because they easily allow the developer to violate invariants held by these types.If the developer wishes to call the unsafe factories, they must maintain the following three invariants hold.
The provided buffer (
ROM<byte>
orROS<byte>
) remains "alive" and immutable for the duration of theUtf8Segment
orUtf8Span
's existence. Whichever component receives aUtf8Segment
orUtf8Span
- however the instance has been created - must never observe that the underlying contents change or that dereferencing the contents might result in an AV or other undefined behavior.The provided buffer contains only well-formed UTF-8 data, and the boundaries of the buffer do not split a multibyte UTF-8 sequence.
For
Utf8Segment
in particular, the caller must not create aUtf8Segment
instance wrapped around aROM<byte>
in circumstances where the component which receives the newly createdUtf8Segment
might tear it. The reason for this is that the "check that theUtf8Segment
instance was not torn across a multi-byte subsequence" protection is only reliable when theUtf8Segment
instance is backed by aUtf8String
. TheUtf8Segment
type makes a best effort to offer protection for other backing buffers, but this protection is not ironclad in those scenarios. This could lead to a violation of invariant (2) immediately above.The type design here - including the constraints placed on segment types and the elimination of the
Char8
type - also draws inspiration from the Go, Swift, and Rust communities.Supporting types
Like
StringComparer
, there's also aUtf8StringComparer
which can be passed into theDictionary<,>
andHashSet<>
constructors. ThisUtf8StringComparer
also implementsIEqualityComparer<Utf8Segment>
, which allows usingUtf8Segment
instances directly as the keys inside dictionaries and other collection types.The
Dictionary<,>
class is also being enlightened to understand that these types have both non-randomized and randomized hash code calculation routines. This allows dictionaries instantiated with TKey =Utf8String
or TKey =Utf8Segment
to enjoy the same performance optimizations as dictionaries instantiated with TKey =string
.Finally, the
Utf8StringComparer
type has convenience methods to compareUtf8Span
instances against one another. This will make it easier to compare texts using specific cultures, even if that specific culture is not the current thread's active culture.Manipulating UTF-8 data
CoreFX and Azure scenarios
What exchange types do we use when passing around UTF-8 data into and out of Framework APIs?
How do we generate UTF-8 data in a low-allocation manner?
How do we apply a series of transformations to UTF-8 data in a low-allocation manner?
Leave everything as
Span<byte>
, use a specialUtf8StringBuilder
type, or something else?Do we need to support UTF-8 string interpolation?
If we have builders, who is ultimately responsible for lifetime management?
Perhaps should look at
ValueStringBuilder
for inspiration.A
MutableUtf8Buffer
type would be promising, but we'd need to be able to generateUtf8Span
slices from it, and if the buffer is being modified continually the spans could end up holding invalid data. Example below:Some folks will want to perform operations in-place.
Sample operations on arbitrary buffers
(Devs may want to perform these operations on arbitrary byte buffers, even if those buffers aren't guaranteed to contain valid UTF-8 data.)
Validate that buffer contains well-formed UTF-8 data.
Convert ASCII data to upper / lower in-place, leaving all non-ASCII data untouched.
Split on byte patterns. (Probably shouldn't split on runes or UTF-8 string data, since we can't guarantee data is well-formed UTF-8.)
These operations could be on the newly-introduced
System.Text.Unicode.Utf8
static class. They would takeROS<byte>
andSpan<byte>
as input parameters because they can operate on arbitrary byte buffers. Their runtime performance would be subpar compared to similar methods onUtf8String
,Utf8Span
, or other types where we can guarantee that no invalid data will be seen, as the APIs which operate on raw byte buffers would need to be defensive and would probably operate over the input in an iterative fashion rather than in bulk. One potential behavior could be skipping over invalid data and leaving it unchanged as part of the operation.Sample
Utf8StringBuilder
implementation for private useCode samples and metadata representation
The C# compiler could detect support for UTF-8 strings by looking for the existence of the
System.Utf8String
type and the appropriate helper APIs onRuntimeHelpers
as called out in the samples below. If these APIs don't exist, then the target framework does not support the concept of UTF-8 strings.Literals
Literal UTF-8 strings would appear as regular strings in source code, but would be prefixed by a u as demonstrated below. The u prefix would denote that the return type of this literal string expression should be
Utf8String
instead ofstring
.The u prefix would also be combinable with the @ prefix and the $ prefix (more on this below).
Additionally, literal UTF-8 strings must be well-formed Unicode strings.
Three alternative designs were considered. One was to use RVA statics (through
ldsflda
) instead of literal UTF-16 strings (throughldstr
) before calling a "load from RVA" method onRuntimeHelpers
. The overhead of using RVA statics is somewhat greater than the overhead of using the normal UTF-16 string table, so the normal UTF-16 string literal table should still be the more optimized case for small-ish strings, which we believe to be the common case.Another alternative considered was to introduce a new opcode
ldstr.utf8
, which would act as a UTF-8 equivalent to the normalldstr
opcode. This would be a breaking change to the .NET tooling ecosystem, and the ultimate decision was that there would be too much pain to the ecosystem to justify the benefit.The third alternative considered was to smuggle UTF-8 data in through a normal UTF-16 string in the string table, then call a
RuntimeHelpers
method to reinterpret the contents. This would result in a "garbled" string for anybody looking at the raw IL. While that in itself isn't terrible, there is the possibility that smuggling UTF-8 data in this manner could result in a literal string which has ill-formed UTF-16 data. Not all .NET tooling is resilient to this. For example, xunit's test runner produces failures if it sees attributes initialized from literal strings containing ill-formed UTF-16 data. There is a risk that other tooling would behave similarly, potentially modifying the DLL in such a manner that errors only manifest themselves at runtime. This could result in difficult-to-diagnose bugs.We may wish to reconsider this decision in the future. For example, if we see that it is common for developers to use large UTF-8 literal strings, maybe we'd want to dynamically switch to using RVA statics for such strings. This would lower the resulting DLL size. However, this would add extra complexity to the compilation process, so we'd want to tread lightly here.
Constant handling
String concatenation
There would be APIs on
Utf8String
which mirror thestring.Concat
APIs. The compiler should special-case the+
operator to call the appropriate overload n-ary overload ofConcat
.Since we expect use of
Utf8String
to be "deliberate" when compared tostring
(see the beginning of this document), we should consider that a developer who is using UTF-8 wants to stay in UTF-8 during concatenation operations. This means that if there's a line which involves the concatenation of both aUtf8String
and astring
, the final type post-concatenation should beUtf8String
.This is still open for discussion, as the behavior may be surprising to people. Another alternative is to produce a build warning if somebody tries to mix-and-match UTF-8 strings and UTF-16 strings in a single concatenation expression.
If string interpolation is added in the future, this shouldn't result in ambiguity. The
$
interpolation operator will be applied to a literalUtf8String
or a literalstring
, and that would dictate the overall return type of the operation.Equality comparisons
There are standard
==
and!=
operators defined on theUtf8String
class.The C# compiler should special-case when either side of an equality expression is known to be a literal null object, and if so the compiler should emit a referential check against the null object instead of calling the operator method. This matches the
if (myString == null)
behavior that thestring
type enjoys today.Additionally, equality / inequality comparisons between
Utf8String
andstring
should produce compiler warnings, as they will never succeed.I attempted to define
operator ==(Utf8String a, string b)
so that I could slap[Obsolete]
on it and generate the appropriate warning, but this had the side effect of disallowing the user to write the codeif (myUtf8String == null)
since the compiler couldn't figure out which overload ofoperator ==
to call. This was also one of the reasons I had opened https://github.com/dotnet/csharplang/issues/2340.Marshaling behaviors
Like the
string
type, theUtf8String
type shall be marshalable across p/invoke boundaries. The corresponding unmanaged type shall beLPCUTF8
(equivalent to aBYTE*
pointing to null-terminated UTF-8 data) unless a different unmanaged type is specified in the p/invoke signature.If a different
[MarshalAs]
representation is specified, the stub routine creates a temporary copy in the desired representation, performs the p/invoke, then destroys the temporary copy or allows the GC to reclaim the temporary copy.If a
Utf8String
must be marshaled from native-to-managed (e.g., a reverse p/invoke takes place on a delegate which has aUtf8String
parameter), the stub routine is responsible for fixing up invalid UTF-8 data before creating theUtf8String
instance (or it may let theUtf8String
constructor perform the fixup automatically).Unmanaged routines must not modify the contents of any
Utf8String
instance marshaled across the p/invoke boundary.Utf8String
instances are assumed to be immutable once created, and violating this assumption could cause undefined behaviors within the runtime.There is no default marshaling behavior for
Utf8Segment
orUtf8Span
since they are not guaranteed to be null-terminated. If in the future the runtime allows marshaling{ReadOnly}Span<T>
across a p/invoke boundary (presumably as a non-null-terminated array equivalent), library authors may fetch the underlyingReadOnlySpan<byte>
from theUtf8Segment
orUtf8Span
instance and directly marshal that span across the p/invoke boundary.Automatic coercion of UTF-16 literals to UTF-8 literals
If possible, it would be nice if UTF-16 literals (not arbitrary
string
instances) could be automatically coerced to UTF-8 literals (via theldstr / call
routines mentioned earlier). This coercion would only be considered if attempting to leave the data as astring
would have caused a compilation error. This could help eliminate some errors resulting from developers forgetting to put the u prefix in front of the string literal, and it could make the code cleaner. Some examples follow.UTF-8 String interpolation
The string interpolation feature is undergoing significant churn (see https://github.com/dotnet/csharplang/issues/2302). I envision that when a final design is chosen, there would be a UTF-8 counterpart for symmetry. The internal
IUtf8Formattable
interface as proposed above is being designed partly with this feature in mind in order to allow single-allocationUtf8String
interpolation.ustring
contextual language keywordFor simplicity, we may want to consider a contextual language keyword which corresponds to the
System.Utf8String
type. The exact name is still up for debate, as is whether we'd want it at all, but we could consider something like the below.The name
ustring
is intended to invoke "Unicode string". Another leading candidate wasutf8
. We may wish not to ship with this keyword support in v1 of theUtf8String
feature. If we opt not to do so we should be mindful of how we might be able to add it in the future without introducing breaking changes.An alternative design to use a
u
suffix instead of au
prefix. I'm mostly impartial to this, but there is a nice symmetry to having the charactersu
,$
, and@
all available as prefixes on literal strings.We could also drop the
u
prefix entirely and rely solely on type targeting:This has implications for string interpolation, as it wouldn't be possible to prepend both the
(ustring)
coercion hint and the$
interpolation operator simultaneously.Switching and pattern matching
If a value whose type is statically known to be
Utf8String
is passed to aswitch
statement, the correspondingcase
statements should allow the use of literalUtf8String
values.Since pattern matching operates on input values of arbitrary types, I'm pessimistic that pattern matching will be able to take advantage of target typing. This may instead require that developers specify the
u
prefix onUtf8String
literals if they wish such values to participate in pattern matching.A brief interlude on indexers and
IndexOf
Utf8String
and related types do not expose an elemental indexer (this[int]
) or a typicalIndexOf
method because they're trying to rid the developer of the notion that bytewise indices into UTF-8 buffers can be treated equivalently as charwise indices into UTF-16 buffers. Consider the naïve implementation of a typical "string split" routine as presented below.One subtlety of the above code is that when culture-sensitive or case-insensitive comparers are used (such as OrdinalIgnoreCase in the above example), the target string doesn't have to be an exact char-for-char match of a sequence present in the source string. For example, consider the UTF-16 string "GREEN" (
[ 0047 0052 0045 0045 004E ]
). Performing an OrdinalIgnoreCase search for the substring "e" ([ 0065 ]
) will result in a match, as 'e' (U+0065
) and 'E' (U+0045
) compare as equal under an OrdinalIgnoreCase comparer.As another example, consider the UTF-16 string "preſs" (
[ 0070 0072 0065 017F 0073 ]
), whose fourth character is the Latin long s 'ſ' (U+017F
). Performing an OrdinalIgnoreCase search for the substring "S" ([ 0053 ]
) will result in a match, as 'ſ' (U+017F
) and 'S' (U+0053
) compare as equal under an OrdinalIgnoreCase comparer.There are also scenarios where the length of the match within the search string might not be equal to the length of the target string. Consider the UTF-16 string "encyclopædia" (
[ 0065 006E 0063 0079 0063 006C 006F 0070 00E6 0064 0069 0061 ]
), whose ninth character is the ligature 'æ' (U+00E6
). Performing an InvariantCultureIgnoreCase search for the substring "ae" ([ 0061 0065 ]
) will result in a match at index 8, as "æ" ([ 00E6 ]
) and "ae" ([ 0061 0065 ]
) compare as equal under an InvariantCultureIgnoreCase comparer.This result is interesting and should give us pause. Since
"æ".Length == 1
and"ae".Length == 2
, the arithmetic at the end of the method will actually result in the wrong substrings being returned to the caller.Due to the nature of UTF-16 (used by
string
), when performing an Ordinal or an OrdinalIgnoreCase comparison, the length of the matched substring within the source will always have achar
count equal totarget.Length
. The length mismatch as demonstrated by "encyclopædia" above can only happen with a culture-sensitive comparer or any of the InvariantCulture comparers.However, in UTF-8, these same guarantees do not hold. Under UTF-8, only when performing an Ordinal comparison is there a guarantee that the length of the matched substring within the source will have a
byte
count equal to the target. All other comparers - including OrdinalIgnoreCase - have the behavior that the byte length of the matched substring can change (either shrink or grow) when compared to the byte length of the target string.As an example of this, consider the string "preſs" from earlier, but this time in its UTF-8 representation (
[ 70 72 65 C5 BF 73 ]
). Performing an OrdinalIgnoreCase for the target UTF-8 string "S"([ 53 ]
) will match on the([ C5 BF ]
) portion of the source string. (This is the UTF-8 representation of the letter 'ſ'.) To properly split the source string along this search target, the caller need to know not only where the match was, but also how long the match was within the original source string.This fundamental problem is why
Utf8String
and related types don't expose a standardIndexOf
function or a standardthis[int]
indexer. It's still possible to index directly into the underlying byte buffer by using an API which projects the data as aROS<byte>
. But for splitting operations, these types instead offer a simpler API that performs the split on the caller's behalf, handling the length adjustments appropriately. For callers who want the equivalent ofIndexOf
, the types instead provideTryFind
APIs that return aRange
instead of a typical integral index value. ThisRange
represents the matching substring within the original source string, and new C# language features make it easy to take this result and use it to create slices of the original source input string.This also addresses feedback that was given in a previous prototype: users weren't sure how to interpret the result of the
IndexOf
method. (Is it a byte count? Is it a char count? Is it something else?) Similarly, there was confusion as to what parameters should be passed to athis[int]
indexer or aSubstring(int, int)
method. By having the APIs promote use ofRange
and related C# language features, this confusion should subside. Power developers can inspect theRange
instance directly to extract raw byte offsets if needed, but most devs shouldn't need to query such information.API usage samples
Scenario: Split an incoming string of the form "LastName, FirstName" into individual FirstName and LastName components.
Additionally, the
SplitResult
struct returned byUtf8Span.Split
implements both a standardIEnumerable<T>
pattern and the C# deconstruct pattern, which allows it to be used separately from enumeration for simple cases where only a small handful of values are returned.Scenario: Split a comma-delimited input into substrings, then perform an operation with each substring.
Miscellaneous topics and open questions
What about comparing UTF-16 and UTF-8 data?
Currently there is a set of APIs
Utf8String.AreEquivalent
which will decode sequences of UTF-16 and UTF-8 data and compare them for ordinal equality. The general code pattern is below.Do we want to add an
operator==(Utf8String, string)
overload which would allow easy==
comparison of UTF-8 and UTF-16 data? There are three main downsides to this which caused me to vote no, but I'm open to reconsideration.The compiler would need to special-case
if (myUtf8String == null)
, which would now be ambiguous between the two overloads. (If the compiler is already special-casing null checks, this is a non-issue.)The performance of UTF-16 to UTF-8 comparison is much worse than the performance of UTF-16 to UTF-16 (or UTF-8 to UTF-8) comparison. When the representation is the same on both sides, certain shortcuts can be implemented to avoid the O(n) comparison, and even the O(n) comparison itself can be implemented as a simple memcmp operation. When the representations are heterogeneous, the opportunity for taking shortcuts is much more restricted, and the O(n) comparison itself has a higher constant factor. Developers might not expect such a performance characteristic from an equality operator.
Comparing a
Utf8String
against a literal string would no longer go through the fast path, as target typing would cause the compiler to emit a call tooperator==(Utf8String, string)
instead ofoperator==(Utf8String, Utf8String)
. The comparison itself would then have the lower performance described by bullet (2) above.One potential upside to having such a comparison is that it would prevent developers from using the antipattern
if (myUtf8String.ToString() == someString)
, which would result in unnecessary allocations. If we are concerned about this antipattern one way to address it would be through a Code Analyzer.What if somebody passes invalid data to the "skip validation" factories?
When calling the "unsafe" APIs, callers are fully responsible for ensuring that the invariants are maintained. Our debug builds could double-check some of these invariants (such as the initial
Utf8String
creation consisting only of well-formed data). We could also consider allowing applications to opt-in to these checks at runtime by enabling an MDA or other diagnostic facility. But as a guiding principle, when "unsafe" APIs are called the Framework should trust the developer and should have as little overhead as possible.Consider consolidating the unsafe factory methods under a single unsafe type.
This would prevent pollution of the type's normal API surface and could help write tools which audit use of a single "unsafe" type.
Some of the methods may need to be extension methods instead of normal static factories. (Example: Unsafe slicing routines, should we choose to expose them.)
Potential APIs to enlighten
System namespace
Include
Utf8String
/Utf8Span
overloads onConsole.WriteLine
. Additionally, perhaps introduce an APIConsole.ReadLineUtf8
.System.Data.* namepace
Include generalized support for serializing Utf8String properties as a primitive with appropriate mapping to
nchar
ornvarchar
.System.Diagnostics.* namespace
Enlighten
EventSource
so that a caller can writeUtf8String
/Utf8Span
instances cheaply. Additionally, some types likeActivitySpanId
already haveROS<byte>
ctors; overloads can be introduced here.System.Globalization.* namespace
The
CompareInfo
type has many members which operate onstring
instances. These should be spanified foremost, andUtf8String
/Utf8Span
overloads should be added. Good candidates areCompare
,GetHashCode
,IndexOf
,IsPrefix
, andIsSuffix
.The
TextInfo
type has members which should be treated similarly.ToLower
andToUpper
are good candidates. Can we get away without enlighteningToTitleCase
?System.IO.* namespace
BinaryReader
andBinaryWriter
should have overloads which operate onUtf8String
andUtf8Span
. These overloads could potentially be cheaper than the normalstring
/ROS<char>
based overloads, since the reader / writer instances may in fact be backed by UTF-8 under the covers. If this is the case then writing is simple projection, and reading is validation (faster than transcoding).File
:WriteAllLines
,WriteAllText
,AppendAllText
, etc. are good candidates for overloads to be added. On the read side, there'sReadAllTextUtf8
andReadAllLinesUtf8
.TextReader.ReadLine
andTextWriter.Write
are also good candidates to overload. This follows the same general premise asBinaryReader
andBinaryWriter
as mentioned above.Should we also enlighten
SerialPort
or GPIO APIs? I'm not sure if UTF-8 is a bottleneck here.System.Net.Http.* namespace
Introduce
Utf8StringContent
, which automatically sets the charset header. This type already exists in the System.Utf8String.Experimental package.System.Text.* namespace
UTF8Encoding
: Overload candidates areGetChars
,GetString
, andGetCharCount
(ofUtf8String
orUtf8Span
). These would be able to skip validation after transcoding as long as the developer hasn't subclassed the type.Rune
: AddToUtf8String
API. AddIsDefined
API to query the OS's NLS tables (could help with databases and other components that need to adhere to strict case / comparison processing standards).TextEncoder
: AddEncode(Utf8String): Utf8String
andFindFirstIndexToEncode(Utf8Span): Index
. This is useful for HTML-escaping, JSON-escaping, and related operations.Utf8JsonReader
: Add read APIs (GetUtf8String
) and overloads to both the ctor andValueTextEquals
.JsonEncodedText
: Add anEncodedUtf8String
property.Regex is a bit of a special case because there has been discussion about redoing the regex stack all-up. If we did proceed with redoing the stack, then it would make sense to add first-class support for UTF-8 here.