Open gafter opened 5 years ago
I’m really interested in learning more about the challenge that has sparked the need for this new type. Where would I go to learn more?
I’m really interested in learning more about the challenge that has sparked the need for this new type. Where would I go to learn more?
@TonyValenti there are likely many resources on the web about this. One thing to simply look into is what utf8 it and how it generally stores things in memory versus the 16-bit encodings that C#/Java have used since the start.
I suggest seeing https://github.com/dotnet/corefxlab/issues/2350 for discussion of Utf8String
.
String concatenation may be somewhat problematic. A + operation between a Utf8String and a string would be ambiguous due to the presence of the following two operators:
Utf8String operator +(Utf8String x, object y); string operator +(object x, string y);
It isn't clear what semantic are desired. Do we need concatenation for Utf8String values?
If we don't want the compiler simply special casing this, a Utf8String operator +(Utf8String x, string y)
and equivalent reverse could fix that surely?
See also https://github.com/dotnet/csharplang/issues/184
We add language support for a new platform type,
Utf8String
(https://github.com/dotnet/corefxlab/issues/2350). This name is tentative and subject to a decision by thecorert
team. For now we use the nameUtf8String
as a placeholder for whatever name it ends up being.Section numbers below refer to the ECMA version of the specification.
The following sections are proposed to be added to the specification
9.2.N Utf8String (in section Types)
The type
System.Utf8String
is a sealed class type that inherits directly fromobject
. In the remainder of the spec, we use the nameUtf8String
to refer to this specific type. Instances ofUtf8String
represent Unicode strings stored internally using the Unicode UTF-8 encoding (https://en.wikipedia.org/wiki/UTF-8).11.2.N Utf8String conversion (in section Implicit Conversions)
An implicit conversion exists from a constant expression of type
string
to the typeUtf8String
. This conversion produces anull
value if the expression's value isnull
. Otherwise the conversion produces an instance ofUtf8String
that represents the same sequence of Unicode codepoints. It is a compile-time error if the characters of the string constant cannot be represented as a valid Unicode UTF-8 sequence. This would occur, for example, if the input string constant contains unmatched surrogates. The result of the conversion is a constant expression of typeUtf8String
.Concatenation
The following addition (no pun intended) is made to 12.9.5 Addition operator:
Utf8String
concatenation:These overloads of the binary
+
operator performUtf8String
concatenation. If an operand ofUtf8String
concatenation isnull
, an emptyUtf8String
is substituted. Otherwise, any non-string operand is converted to itsUtf8String
representation by invoking the virtualToString
method inherited from typeobject
and then encoding the result as aUtf8String
. IfToString
returnsnull
, an emptyUtf8String
is substituted. If the string returned byToString
is not representable as aUtf8String
, aSystem.ArgumentException
is thrown.The result of the
Utf8String
concatenation operator is aUtf8String
that consists of the characters of the left operand followed by the characters of the right operand. TheUtf8String
concatenation operator never returns anull
value. ASystem.OutOfMemoryException
may be thrown if there is not enough memory available to allocate the resultingUtf8String
.Constant Expressions
The following changes are made to 12.20 Constant expressions:
Change this sentence
to this
We add the Utf8String conversion to the set of conversions permitted in a constant expression.
Open Issues
Concatenation
String concatenation may be somewhat problematic. A
+
operation between aUtf8String
and astring
would be ambiguous due to the presence of the following two operators:Utf8String operator +(Utf8String x, object y);
string operator +(object x, string y);
It isn't clear what semantic are desired. Do we need concatenation for
Utf8String
values?Interpolation
There is no easy way to use interpolation to get a
Utf8String
value. One approach would be to define a new interpolated string conversion from an interpolated string to the typeUtf8String
. That would permit us to issue a compile-time error if the format string contains unmatched surrogates.