cplusplus / CWG

Core Working Group
23 stars 7 forks source link

[basic.fundamental] Decide on a unified definition of byte-like types #403

Closed Eisenwave closed 1 year ago

Eisenwave commented 1 year ago

Full name of submitter): Jan Schultke

Reference (section label): [basic.fundamental]

Issue Description

The Status Quo in the C++ Standard

Numerous sections in the C++ standard have an implicit notion of a byte-like type. This always includes unsigned char and std::byte, sometimes includes unsigned ordinary character types (presumable includes a char which's underlying type is unsigned char), and sometimes includes char. signed char is never considered to be a byte-like type:

Confusing Effects of the Status Quo

The problem with the status quo is that every section has its own concept of a byte-like type, and this concept is never formalized by any. This leads to surprising and illogical inconsistencies, such as:

  1. A char* can alias any object, but a char[] doesn't provide storage for an object.
  2. a char[] allocated with a new-expression is maximally aligned just like unsigned char[] (presumably so that it can provide storage), but a char[] cannot provide storage unlike unsigned char[].
  3. copying the bytes of a trivially copyable type into a char[] is possible, and implicitly creates objects if done through std::memcpy, but beginning the lifetime of a char[] does not implicitly create objects in itself (only unsigned char[] has that property).
  4. Copying indeterminate bytes of a trivially copyable type into an unsigned char[] may be safe due to the relaxations in [basic.indet], however, copying into char[] isn't safe. This is particularly dangerous because trivially copyable types allow transfer of values through char[], but this is seemingly not allowed if any bytes are indeterminate.

Why char Should Be Universally Byte-Like

It is obvious that some definition of a byte-like type would be useful, however, it is not obvious whether char should be included in such a definition, and when. The advantages of universally considering char to be byte-like are:

Minor Negative Consequences of a Byte-Like char

The main downside to making char byte-like is that char[] can provide storage and implicitly create objects. When char[] is used as a string, and not as a byte-like type, this behavior can be surprisingly permissive. However, this is not a safety issue, and the opportunities for compilers to utilize this restriction for the purpose of optimization are rare, especially considering that char* can alias any other pointer already.

Suggested Resolution

Unify the definition of a byte-like type. To [basic.fundamental] p7 (or to a separate paragraph) add:

+The types char, unsigned char, and std::byte (from <cstddef>) are collectively called byte-like types.

In [intro.object] p3:

If a complete object is created in storage associated with another object e of type
-"array of N unsigned char" or of type "array of N std​::​byte"
+"array of byte-like type"
[...]

In [intro.object] p13:

An operation that begins the lifetime of an array of
-unsigned char or std​::​byte implicitly
+byte-like type
creates objects within the region of storage occupied by the array.

In [basic.indet] (multiple paragraphs), as well as in [bit.cast] p2:

-unsigned ordinary character type or std::byte
+byte-like type

Editorial Changes

The following sections already consider char to be byte-like, and are only affected editorially.

In [basic.life] p6.4:

[...] except when the conversion is to pointer to cv void, or to pointer to cv void and subsequently to
-pointer to cv char, cv unsigned char, or cv std​::​byte
+pointer to cv byte-like type

In [basic.types.general] p2:

the underlying bytes making up the object can be copied into an array of
-char, unsigned char, or std​::​byte
+byte-like type

In [basic.lval] p11.3:

-a char, unsigned char, or std​::​byte type.
+a byte-like type

In [expr.new] p16:

For arrays of
-char, unsigned char, and std​::​byte,
+byte-like type,
the difference between the result of the new-expression and the address returned by the allocation function [...]
Eisenwave commented 1 year ago

Related: https://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#350. This rejected issue suggested to defined char, signed char, and unsigned char as byte-character types, and followed a similar strategy in relaxing requirements. However, this was rejected with:

The CWG was not convinced that there was a need to change the existing specification at this time. Some were concerned that there might be implementation difficulties with giving signed char the requisite semantics; implementations for which that is true can currently make char equivalent to unsigned char and avoid those problems, but the suggested change would undermine that strategy.

This new issue does not attempt to give signed char the same semantics as unsigned char. Instead, char can be made unsigned by the implementation, and all the proposed changes become implementable.

jensmaurer commented 1 year ago

The inconsistencies are intentional. Historic CWG discussions were clear that new facilities should not consider char special, and, in hindsight, it was a mistake to e.g. extend the special aliasing allowances to char.

That's why std::byte was introduced, and there was the hope we can eventually remove char from any of the special exception lists it is currently on. Maybe in a decade or two.

Any change of direction in this area needs a paper to EWG; it is out of scope for a core issue.