ThePhD / future_cxx

Work done today for the glory of tomorrow - or, C and C++ systems programming papers.
https://thephd.dev/portfolio/standard
46 stars 8 forks source link

N2900 - Consistent, Warningless, and Intuitive Initialization with {} #37

Closed ThePhD closed 2 years ago

ThePhD commented 3 years ago

Because I hate myself.

{} deserves to be in C. It's stupid I have to write a paper for something so obvious, but here we are. 12 countries 17 companies and we need a paper full of motivational wank to say "hey maybe do that thing we've been doing for 15 summat years, eh?".

Extremely Functional And Not Way Too Bureaucratic System Of Governance Here, Chief.

ThePhD commented 3 years ago

Lots more wording needs to be fixed for this paper. We'll see how it goes. I should probably finish this one over the next week, after working on the Clang #embed implementation.

https://thephd.dev/_vendor/future_cxx/papers/C%20-%20Consistent,%20Warningless,%20and%20Intuitive%20Initialization%20with%20%7B%7D.html

nikic commented 2 years ago

Not sure if this is the right place to ask... The current draft modifies 6.7.9.10 to add:

If an object that has automatic storage duration is initialized with an empty initializer, its value is the same as the initialization of a static storage duration object.

Why is this addition necessary? Doesn't this contradict other parts of the proposal, especially with regards to unions? This would imply that union U u = {} zeros the whole union, not just the first member.

ThePhD commented 2 years ago

This defers the rules for initialization to the static storage duration rules when = {} is involved. That's intentional, and necessary. You must explicitly state what = {} is going to do.

Note that the change that "zeros the whole union" is an OPTIONAL change:

4.5. OPTIONAL CHANGE 0: Largest-Then-First Initialization - Modify §6.7.9 paragraph 10, last bullet point

It must be voted on by the Committee, IN ADDITION to a vote for the rest of the paper. If the vote for this optional change does not go through, than the zeros-the-whole-union change does not go in, but the rest of the changes do.

I expect it to fail.

ThePhD commented 2 years ago

(Note that without that change, the rules for static init are "first object, and the rest are left unspecified, and padding bits are zero", which is how it's always worked.)

nikic commented 2 years ago

This defers the rules for initialization to the static storage duration rules when = {} is involved. That's intentional, and necessary. You must explicitly state what = {} is going to do.

The static storage duration rules also zero the padding though, while explicit initialization does not -- it only initializes all subobjects that have not been explicitly initialized using static duration initialization (6.7.9.19). I would have expected that phrasing to also cover the case where zero subobjects are explicitly initialized. That is, in that case all subobjects will receive static duration initialization (aka completely zeroed) while the padding between them remains uninitialized.

Or maybe I have hopelessly confused myself here?

ThePhD commented 2 years ago

You are reading too far. The empty initialization case is handled before we ever reach Paragraph 19, which is describing initialization from a sequence of provided initializers. This is why we provide wording that says "if the initializer is the empty initializer, we perform static initialization". That applies to the whole object, which in this case would be the whole union. That means we stop at §6.7.9¶10, and go no farther in the init rules except to recursively perform initialization of the first member, which again is init as if by static storage duration, while the rest are left unspecified and the padding bits (bits not part of ANY member of a union) are set to 0.

nikic commented 2 years ago

To be clear my comment was a response to this:

That's intentional, and necessary. You must explicitly state what = {} is going to do.

That is, I believe that the behavior would still be well-defined (but subtly different) without the addition to paragraph 10, because we would follow the behavior of paragraph 19 with zero initializers. I do understand that with the addition to paragraph 10, paragraph 19 no longer applies.

The reason I'm asking about this is that clang currently implements = {} the way it would behave without the special-case addition in paragraph 10, i.e. the same as = {0} (see https://c.godbolt.org/z/EPT5eMTr8) and I wanted to double check that the difference was intentional.

Based on your response, it seems that the difference is intentional, so clang should switch to using zeroinitializer initialization for the empty initializer list case, to satisfy the new wording.

ThePhD commented 2 years ago

I'm not sure I'm following here, so I think it's me that has the wrong idea. Let's start from the beginning.

We've got a union with 2 members, union U as you defined above:

union U {
    int x;
    long y;
} __attribute__((aligned(16)));

The wording for paragraph 10 says:

— if it has arithmetic type, it is initialized to (positive or unsigned) zero; … — f it is a union, the first named member is initialized (recursively) according to these rules, and any padding is initialized to zero bits;

So by having union U u = {0}, it starts by first initializing the first member of the union (x) to zero. It then literally cannot initialize anymore objects because it's a union. This is important because there are footnotes (non-normative) and normative text in paragraph 17 to let us know that we do not treat each member in the union as a "list of things to initialize". It is only the first member which matters:

… Each brace-enclosed initializer list has an associated current object. When no designations are present, subobjects of the current object are initialized in order according to the type of the current object: array elements in increasing subscript order, structure members in declaration order, and the first named member of a union.160) In contrast, a designation causes the following initializer to begin initialization of the subobject described by the designator. Initialization then continues forward in order, beginning with the next subobject after that described by the designator.161)

Two parts are emphasis mine, because they describe what happens to the rest of the union. More accurately, footnote 161 describes why we only deal with the first object, and why the initialization of the rest of the object is unspecified:

161)After a union member is initialized, the next object is not the next member of the union; instead, it is the next subobject of an object containing the union.

= { 0 } and = { } for a union both thus have identical behavior, as the 0 initializes the first member (and 0 has exactly the right behavior that directly matches how static storage duration works for C, according to paragraph 10). Doing = { } means simply doing static initalization for the first member, and leaving the rest of the members of the union (anything that is not part of the first) in an undefined/unspecified state. The generated llvm bitcode is entirely correct here to mark everything after the first member as undef.

Changing to zeroinitialize is fine because it's standards-compliant (it doesn't specify what happens), but it's wrong to state the change is mandatory. Only the first object (in this case, the int x;) is initialized to "positive or unsigned zero" (e.g., what {0} would do for that integer) and the rest is left undefined. Padding bits (outside of both the int x and long y) should be set to 0 as per the specification, which is identical behavior.

(Note that I don't know why LLVM isn't setting the bits beyond long y; as zeroinitialize and is instead using undef. That sounds like a question for LLVM folk, and already my knowledge of LLVM for this conversation is entirely at its limit and I'm just mostly guessing what that bit code is supposed to mean.)

ThePhD commented 2 years ago

(Upon second reading, I see the LLVM bit code just literally doesn't include any annotation for bits beyond the objects the named ones stored inside. So I guess by-omission LLVM is zero-padding the structure where necessary, or maybe that's handled in some code generation pass or whatever.)

nikic commented 2 years ago

I'm not sure I'm following here, so I think it's me that has the wrong idea. Let's start from the beginning.

We've got a union with 2 members, union U as you defined above:

union U {
    int x;
    long y;
} __attribute__((aligned(16)));

The wording for paragraph 10 says:

— if it has arithmetic type, it is initialized to (positive or unsigned) zero; … — f it is a union, the first named member is initialized (recursively) according to these rules, and any padding is initialized to zero bits;

So by having union U u = {0}, it starts by first initializing the first member of the union (x) to zero. It then literally cannot initialize anymore objects because it's a union. This is important because there are footnotes (non-normative) and normative text in paragraph 17 to let us know that we do not treat each member in the union as a "list of things to initialize". It is only the first member which matters:

I think the important missing bit is that paragraph 10 starts with:

If an object that has static or thread storage duration is not initialized explicitly, then [...]

That is, paragraph 10 does not apply to any cases that use an explicit initializer. If there is an initializer list, then paragraph 10 only comes into play if later parts refer back to static storage duration initialization, like paragraph 19 does.

… Each brace-enclosed initializer list has an associated current object. When no designations are present, subobjects of the current object are initialized in order according to the type of the current object: array elements in increasing subscript order, structure members in declaration order, and the first named member of a union.160) In contrast, a designation causes the following initializer to begin initialization of the subobject described by the designator. Initialization then continues forward in order, beginning with the next subobject after that described by the designator.161)

Two parts are emphasis mine, because they describe what happens to the rest of the union. More accurately, footnote 161 describes why we only deal with the first object, and why the initialization of the rest of the object is unspecified:

161)After a union member is initialized, the next object is not the next member of the union; instead, it is the next subobject of an object containing the union.

= { 0 } and = { } for a union both thus have identical behavior, as the 0 initializes the first member (and 0 has exactly the right behavior that directly matches how static storage duration works for C, according to paragraph 10). Doing = { } means simply doing static initalization for the first member, and leaving the rest of the members of the union (anything that is not part of the first) in an undefined/unspecified state. The generated llvm bitcode is entirely correct here to mark everything after the first member as undef.

Changing to zeroinitialize is fine because it's standards-compliant (it doesn't specify what happens), but it's wrong to state the change is mandatory. Only the first object (in this case, the int x;) is initialized to "positive or unsigned zero" (e.g., what {0} would do for that integer) and the rest is left undefined. Padding bits (outside of both the int x and long y) should be set to 0 as per the specification, which is identical behavior.

(Note that I don't know why LLVM isn't setting the bits beyond long y; as zeroinitialize and is instead using undef. That sounds like a question for LLVM folk, and already my knowledge of LLVM for this conversation is entirely at its limit and I'm just mostly guessing what that bit code is supposed to mean.)

This is actually exactly what I wanted to highlight here: That clang sets not only the part of the union that belongs to y to undef, but also the padding bytes after y. And the same for the struct case, the padding bytes are left as uninitialized rather than zero.

So I think the core question here ends up being independent of your proposal: Does the standard really require that = {0} initializes padding that is not part of subobjects? Based on my reading (and clang's implementation) the answer is "no", because paragraph 10 does not apply, and other parts don't require padding initialization. But if the answer is "yes", then = {0} would indeed be equivalent to static duration storage initialization, any my original question would be moot.

ThePhD commented 2 years ago

It might not require it, but honestly that's beyond this proposal. It's been mentioned in the C Standard Committee last meeting that someone (haha "someone" is going to end up being me, isn't it?) that we should go through the initialization paragraph, actually name each kind of initialization, and then rather have in-line prose like this with conditions and "if"s that might confuse people, just refer to things directly so we stop this at the root.

I was planning to do that after this paper passed, though, as an editorial reorganization.

nikic commented 2 years ago

Yeah, that would be great :) The current phrasing is really not easy to follow.

I think the bit that is relevant to this paper is that if = {0} initialization and static storage duration initialization are not the same (due to different padding initialization requirements), then should = {} have the same semantics as = {0} or the same semantics as static storage duration initialization? If they are the same, then it doesn't matter.

ThePhD commented 2 years ago

Added to C23, with a fix needed in #51.

https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2912.pdf