dotnet / csharplang

The official repo for the design of the C# programming language
11.61k stars 1.03k forks source link

UTF8 String Literals - Draft Specification #2911

Open gafter opened 5 years ago

gafter commented 5 years ago

See also https://github.com/dotnet/csharplang/issues/184

We add language support for a new platform type, Utf8String (https://github.com/dotnet/corefxlab/issues/2350). This name is tentative and subject to a decision by the corert team. For now we use the name Utf8String as a placeholder for whatever name it ends up being.

Section numbers below refer to the ECMA version of the specification.

The following sections are proposed to be added to the specification

9.2.N Utf8String (in section Types)

The type System.Utf8String is a sealed class type that inherits directly from object. In the remainder of the spec, we use the name Utf8String to refer to this specific type. Instances of Utf8String represent Unicode strings stored internally using the Unicode UTF-8 encoding (https://en.wikipedia.org/wiki/UTF-8).

11.2.N Utf8String conversion (in section Implicit Conversions)

An implicit conversion exists from a constant expression of type string to the type Utf8String. This conversion produces a null value if the expression's value is null. Otherwise the conversion produces an instance of Utf8String that represents the same sequence of Unicode codepoints. It is a compile-time error if the characters of the string constant cannot be represented as a valid Unicode UTF-8 sequence. This would occur, for example, if the input string constant contains unmatched surrogates. The result of the conversion is a constant expression of type Utf8String.

Concatenation

The following addition (no pun intended) is made to 12.9.5 Addition operator:

These overloads of the binary + operator perform Utf8String concatenation. If an operand of Utf8String concatenation is null, an empty Utf8String is substituted. Otherwise, any non-string operand is converted to its Utf8String representation by invoking the virtual ToString method inherited from type object and then encoding the result as a Utf8String. If ToString returns null, an empty Utf8String is substituted. If the string returned by ToString is not representable as a Utf8String, a System.ArgumentException is thrown.

The result of the Utf8String concatenation operator is a Utf8String that consists of the characters of the left operand followed by the characters of the right operand. The Utf8String concatenation operator never returns a null value. A System.OutOfMemoryException may be thrown if there is not enough memory available to allocate the resulting Utf8String.

Constant Expressions

The following changes are made to 12.20 Constant expressions:

Change this sentence

If a constant expression is a reference type, it must be the string type, a default value expression (§12.7.15) for some reference type, or the value of the expression must be null.

to this

If a constant expression is a reference type, it must be the string type, the Utf8String type, a default value expression (§12.7.15) for some reference type, or the value of the expression must be null.

We add the Utf8String conversion to the set of conversions permitted in a constant expression.

Open Issues

Concatenation

String concatenation may be somewhat problematic. A + operation between a Utf8String and a string would be ambiguous due to the presence of the following two operators:

It isn't clear what semantic are desired. Do we need concatenation for Utf8String values?

Interpolation

There is no easy way to use interpolation to get a Utf8String value. One approach would be to define a new interpolated string conversion from an interpolated string to the type Utf8String. That would permit us to issue a compile-time error if the format string contains unmatched surrogates.

TonyValenti commented 5 years ago

I’m really interested in learning more about the challenge that has sparked the need for this new type. Where would I go to learn more?

CyrusNajmabadi commented 5 years ago

I’m really interested in learning more about the challenge that has sparked the need for this new type. Where would I go to learn more?

@TonyValenti there are likely many resources on the web about this. One thing to simply look into is what utf8 it and how it generally stores things in memory versus the 16-bit encodings that C#/Java have used since the start.

gafter commented 5 years ago

I suggest seeing https://github.com/dotnet/corefxlab/issues/2350 for discussion of Utf8String.

john-h-k commented 5 years ago

String concatenation may be somewhat problematic. A + operation between a Utf8String and a string would be ambiguous due to the presence of the following two operators:

Utf8String operator +(Utf8String x, object y);
string operator +(object x, string y);

It isn't clear what semantic are desired. Do we need concatenation for Utf8String values?

If we don't want the compiler simply special casing this, a Utf8String operator +(Utf8String x, string y) and equivalent reverse could fix that surely?