cplusplus / draft

C++ standards drafts
http://www.open-std.org/jtc1/sc22/wg21/
5.72k stars 752 forks source link

[lex.ccon] What is the single code unit for an ordinary character literal or wide character literal? CWG2779 #4517

Open xmh0511 opened 3 years ago

xmh0511 commented 3 years ago

As the special rules specified in [lex.ccon]#1, that is:

A non-encodable character literal is a character-literal whose c-char-sequence consists of a single c-char that is not a numeric-escape-sequence and that specifies a character that either lacks representation in the literal's associated character encoding or that cannot be encoded as a single code unit.

The Unicode standard specifies how large a code unit for UTF8, UTF16, and UTF32 respectively. Which has a similar meaning as stated in wiki Character_encoding. However, it does not state how large the code unit for the encoding of the execution (wide-)character set. So, in this case, how to determine whether a code point value for a character in an ordinary or wide character literal can be encoded as a single code unit for the corresponding kind character literal?

Is it a good idea to change the wording "cannot be encoded as a single code unit" to "cannot be represented by an object with the type of the corresponding kind character-literal"?

jensmaurer commented 3 years ago

I think [basic.fundamental] p7 and p8 try to establish the relationship between the type and code unit, but this could certainly be clearer.

xmh0511 commented 3 years ago

I think [basic.fundamental] p7 and p8 try to establish the relationship between the type and code unit, but this could certainly be clearer.

Although p7 states

The values of type char can represent distinct codes for all members of the implementation's basic character set.

However, here is unclear that whether the wording "implementation's basic character set" refers to "basic source character set " or "basic execution character set". Presumably, it refers to the latter. But, as stated in [lex.charset#3]. Execution character set is a superset of a basic execution character set.

Take Execution character set as set S and take basic execution character set as set A where A⊆S

As the lex.ccon#tab:lex.ccon.literal indicates, we don't know whether an element in the absolute complement set(∁UA) of basic execution character set can be encoded in a char object. After all, the standard does not specify how to encode an execution character set except that it specifies the value 0 for the null character.

jensmaurer commented 3 years ago

This is being addressed by P2314 Character sets and encodings cplusplus/papers#998.

frederick-vs-ja commented 1 month ago

This seems covered by CWG2779.