cplusplus / CWG

Core Working Group
23 stars 7 forks source link

CWG2779 [lex.ccon] What are the types of single code units of character-literal and string-literral? #285

Open xmh0511 opened 1 year ago

xmh0511 commented 1 year ago

Full name of submitter (unless configured in github; will be published with the issue): Jim X

The original issue is https://github.com/cplusplus/draft/issues/5247

[lex.ccon] p1 says

A non-encodable character literal is a character-literal whose c-char-sequence consists of a single c-char that is not a numeric-escape-sequence and that specifies a character that either lacks representation in the literal's associated character encoding or that cannot be encoded as a single code unit.

[lex.charset] p8 only says

A code unit is an integer value of character type ([basic.fundamental]).

So, what is the type of a single code unit for ordinary character literal, wide character literal, UTF-8 character literal, UTF-16 character literal, UTF-32 character literal? We didn't clearly specify them. The coloum Type in the table just specify what the type of the character literal is.

The kind of a character-literal, its type, and its associated character encoding ([lex.charset]) are determined by its encoding-prefix and its c-char-sequence as defined by Table 9.

We didn't specify what the type of a single code unit for the corresponding literal is, then we cannot determine what the range of the representable values of a single code unit is, that is, we cannot determine whether a character can be encoded in a single code unit or not for that character-literal.

Similarly, we didn't specify what the type of a single code unit for a string-literal is. At best, [lex.string] p10 can barely imply the single code unit's type is relevant to the element type of the string-literal.

String literal objects are initialized with the sequence of code unit values corresponding to the string-literal's sequence...

Because the string literal object is of type array of n T, since the array is initialized by the sequence of code unit, that is, the element corresponds to a code unit, which may imply that the element type is relevant to the type of a single code unit in the string-literal. There is an exception to the ordinary string literal, assume it uses utf-8 encoding

const char* ptr = "Õ";

The sequence of the code unit values will be {195, 149, 0 }. Both 195 and 149 cannot be represented in the type char if we assume char is signed char with width 8. If the array object is initialized with this sequence, there is narrow conversions, which cause the program ill-formed.

Anyway, we don't have any implication wording for character-literal.

Suggested Resolution

Presumably, the unsigned version of the Type column in the table should also be the type of the single code unit for that character-literal, and the unsigned version of the element type of a string-literral should also be the single code unit for that string-literral, we should clearly specify the type of a single code unit for the character-literal and string-literal.

frederick-vs-ja commented 1 year ago

I think the key point should be that a code unit (at compilation time) should have an unsigned type.

In practice, possibly signed char and wchar_t are often used to represent UTF code units (at runtime). But the term "code unit" in the standard seems only used for translation, so IMO we don't need to cover such usages.

Alternative suggested resolution:

Change [lex.charset] p8 to

A code unit of a character-literal or a string-literal is an integer value of the unsigned version of the underlying type of the character type ([basic.fundamental]) determined by its kind ([lex.ccon], [lex.string]), except that the character type is always char for a character-literal without encoding prefix. [...]

Change [lex.ccon] p3.1 to

[...] is the code unit value of the specified character as encoded in the literal's associated character encoding, converted to the character-literal's type.

Change [lex.string] p10 to

String literal objects are initialized with the sequence of code unit values corresponding to the string-literal's sequence of s-char (originally from non-raw string literals) and r-chars (originally from raw string literals), plus a terminating U+0000 NULL character, each converted to the string-literal's array element type, in order as follows: [...]

xmh0511 commented 1 year ago

except that the character type is always char for a character-literal without encoding prefix.

What does this wording intend to mean?

char c = '@';

The code point of @ is \u0040, which is not within the basic literal character set, but its value is positive, I think.

frederick-vs-ja commented 1 year ago

except that the character type is always char for a character-literal without encoding prefix.

What does this wording intend to mean?

char c = '@';

The code point of @ is \u0040, which is not within the basic literal character set, but its value is positive, I think.

'@' will be added to the basic literal character set by P2558 in C++26 (see also https://wg21.link/p2558/github). Perhaps no CWG issue is needed for this.

xmh0511 commented 1 year ago

except that the character type is always char for a character-literal without encoding prefix.

But, the intent of this wording is still unclear. character-literal without encoding prefix is specified with type char. The type of the code unit does not modify the type of character-literal's type, I think. Even though it is a character-literal without an encoding prefix, we can still say the code unit has the unsigned version of that type.

tahonermann commented 1 year ago

The code point of @ is \u0040, which is not within the basic literal character set, but its value is positive, I think.

For characters that are not in the basic literal character set, how such characters are encoded for ordinary character literals and wide character literals is implementation-defined because the character encoding used is implementation-defined in those cases. Thus, there is no requirement that the value be positive.

xmh0511 commented 1 year ago

the character encoding used is implementation-defined in those cases.

Could you please supply the link about either GCC or Clang documenting this part? I wonder how it places the utf-8 encodings(commonly known, which is the encoding they adopted and are the positive values) into char for ordinary literal.

frederick-vs-ja commented 1 year ago

The code point of @ is \u0040, which is not within the basic literal character set, but its value is positive, I think.

For characters that are not in the basic literal character set, how such characters are encoded for ordinary character literals and wide character literals is implementation-defined because the character encoding used is implementation-defined in those cases. Thus, there is no requirement that the value be positive.

Oh... do you mean that when char is signed and used for representing UTF-8 code units in character literals, the code unit values are first considered to be converted into [-128, 127] before further process in translation?

jensmaurer commented 1 year ago

I don't think the "code unit type" is observable, so why would the standard need to specify it? As Tom pointed out, encodings of ordinary and wide character/string literals are implementation-defined.

If you feel there are gaps in the documentation for implementation-defined behavior for some compilers, feel free to post bug reports to them.

xmh0511 commented 1 year ago

I don't think the "code unit type" is observable

Consider this example, which is specified by the standard to have utf-8 encoding:

auto* ptr = u8"牛逼";

The code points of such two characters are \u725B, \u903C, the sequence of code units values in utf-8 encoding are {0xE7, 0x89, 0x9B, 0xE9, 0x80, 0xBC }, we say

String literal objects are initialized with the sequence of code unit values...

So, what is the type of each code unit value in the sequence {0xE7, 0x89, 0x9B, 0xE9, 0x80, 0xBC }? If we didn't specify the type, how do we check the initialization for array object?

tahonermann commented 1 year ago

Oh... do you mean that when char is signed and used for representing UTF-8 code units in character literals, the code unit values are first considered to be converted into [-128, 127] before further process in translation?

Since the literal encoding is implementation-defined, yes, that would be a conforming implementation.

Consider this example, which is specified by the standard to have utf-8 encoding:

In that example, auto will be deduced as const char8_t (which has unsigned char as its underlying type; [basic.fundamental]p9).

So, what is the type of each code unit value in the sequence {0xE7, 0x89, 0x9B, 0xE9, 0x80, 0xBC }? If we didn't specify the type, how do we check the initialization for array object?

I don't see why a type is relevant. As Jens stated, these intermediate values are not observable. For the encodings specified by the standard (UTF-8, UTF-16, and UTF-32) the code unit values are guaranteed to be in the range of the element type of the character or string literal.

jensmaurer commented 1 year ago

Per [lex.string] table 12, the string literal u8"牛逼" has type "array of const char8_t", which we initialize with the code unit values you gave. We know the underlying type, so we know how the values end up.

What kind of checking do you envision for the array object? Note that we're not doing brace-initialization here, but initializing each element individually, so we don't apply narrowing checks.

xmh0511 commented 1 year ago

which we initialize with the code unit values you gave. We know the underlying type, so we know how the values end up.

So, why it is necessary to say

A code unit is an integer value of character type ([basic.fundamental]).

We totally could say that the code unit is an integer value of the implementation-defined chosen character type. In other words, is it the intent that the implementation can use code unit of type char to initialized character object with type char8_t?

For example:

u8"牛逼";

Can the array object of type array of N char8_t be initialized by the sequence (char)0xE7, (char)0x89, (char)0x9B, (char)0xE9, (char)0x80, (char)0xBC? Is it a conforming implementation?

jensmaurer commented 1 year ago

Yes, I think that's conforming.

You couldn't tell the difference even if char is signed (because conversion from/to signed/unsigned integer variants is simply a modulo 2^N operation).

xmh0511 commented 1 year ago

I think that change [lex.charset] p8 to that

A code unit is an integer value of the implementation-defined chosen character type.

This makes sense here.

frederick-vs-ja commented 1 year ago

Perhaps we don't need to assign types (in the C++ type system) to code units, since mathematical integer values seem sufficient.

However, [lex.ccon] p3.1 currently says (emphasis mine):

A character-literal with a c-char-sequence consisting of a single basic-c-char, simple-escape-sequence, or universal-character-name is the code unit value of the specified character as encoded in the literal's associated character encoding.

which possibly implies that the code unit value in such case has the same type as the character literal.

tahonermann commented 1 year ago

I think that change [lex.charset] p8 to that

A code unit is an integer value of the implementation-defined chosen character type.

This makes sense here.

I disagree. The term "code unit" has more broad applicability; it isn't intended to be used solely for the initialization of character and string literals. The initialized array continues to hold code unit values following the initialization.

xmh0511 commented 1 year ago

The initialized array continues to hold code unit values following the initialization.

The initialized array just holds the integer value of the code units. Is there any conflict between the integer value of a code unit and its type? It is similar to we can use int object to hold the value of type char that represents a character in the basic character set. It does not change anything, we just use that object holds the value as long as the whole value representation can be held.

tahonermann commented 1 year ago

The initialized array just holds the integer value of the code units.

Agreed.

Is there any conflict between the integer value of a code unit and its type?

I don't believe so.

It is similar to we can use int object to hold the value of type char that represents a character in the basic character set. It does not change anything, we just use that object holds the value as long as the whole value representation can be held.

Agreed, but there are associated semantics. If an int holds a distance, it is important to know if that distance is specified in SI or English units. Likewise, when considering code unit values, it is important to know for which encoding. 0xFF is a valid code unit value for ISO-8859-1 but not for UTF-8. Since the encoding is implementation-defined in the case of ordinary and wide character and string literals, the standard must defer to the implementation with regard to code unit values (or other encoding specific properties).

xmh0511 commented 1 year ago

Since the encoding is implementation-defined in the case of ordinary and wide character and string literals, the standard must defer to the implementation with regard to code unit values (or other encoding specific properties).

I just suggest that the type of the code unit is implementation chose of the character type. Both its value and its type are not specified by the standard.

tahonermann commented 1 year ago

I just suggest that the type of the code unit is implementation chose of the character type. Both its value and its type are not specified by the standard.

I don't see how additional discussion of an unobservable type would make the specification more clear.

I scrolled back up to re-read the discussion so far. This issue started by quoting this wording:

A non-encodable character literal is a character-literal whose c-char-sequence consists of a single c-char that is not a numeric-escape-sequence and that specifies a character that either lacks representation in the literal's associated character encoding or that cannot be encoded as a single code unit.

I wonder if there is a misunderstanding here. The intent of the "cannot be encoded as a single code unit" has nothing to do with the range of the code unit type; it has to do with whether the encoding specifies that multiple code units are encoded for a given character (e.g., UTF-8 specifies that U+00E9 (é) is encoded as two code units (0xC3 0xA9). Perhaps the following change would make this more clear.

A non-encodable character literal is a character-literal whose c-char-sequence consists of a single c-char that is not a numeric-escape-sequence and that specifies a character that either lacks representation in the literal's associated character encoding or is encoded with multiple code units.

xmh0511 commented 1 year ago

I wonder if there is a misunderstanding here. The intent of the "cannot be encoded as a single code unit" has nothing to do with the range of the code unit type;

I do think whether a character can be encoded in a single code unit is sensitive to the type. For example, U+00E9 (é) cannot be encoded as a single code unit, but UTF-16 or UTF-32 can, that is why we specify the type of the character literal for such two encodings are char16_t, char32_t, because we have to guarantee the value of the code unit can be representable within the object of that type.

tahonermann commented 1 year ago

It sounds like you are arguing for adding additional stipulations regarding the properties an encoding must have to be considered valid for use as the ordinary or wide literal encoding. If so, I agree that wording in that regard is likely deficient at present.

I'm not sure what would be helpful though; there is no observable distinction between encoding a (unsigned) sequence of {0xC3 0xA9} UTF-8 code units in a sequence of char when char is an 8-bit signed type and encoding {-0x3D, -0x57} and claiming an implementation-defined encoding that works just like UTF-8 except that code unit values above 0x7F are encoded in two's complement.

xmh0511 commented 1 year ago

It sounds like you are arguing for adding additional stipulations regarding the properties an encoding must have to be considered valid for use as the ordinary or wide literal encoding. If so, I agree that wording in that regard is likely deficient at present.

Yes, I just meant this, as well as, you have suggested in the above comment(https://github.com/cplusplus/CWG/issues/285#issuecomment-1493699830)

frederick-vs-ja commented 1 year ago

It sounds like you are arguing for adding additional stipulations regarding the properties an encoding must have to be considered valid for use as the ordinary or wide literal encoding. If so, I agree that wording in that regard is likely deficient at present.

I guess the type-related issue might be that we haven't forbidden pathological choices where

which would make a non-null character literal equal to '\0'.

Perhaps it would be better to specify that the range of valid code unit values of ordinary or wide literal encoding is not wider than the range of char or wchar_t respectively.

jensmaurer commented 1 year ago

CWG2779