Closed avery-laird closed 1 month ago
Can you give a link to https://alive2.llvm.org with the src and tgt that demonstrates the problem?
Here is an example: https://alive2.llvm.org/ce/z/wM9hg4
Plus, the C->IR: https://godbolt.org/z/E6v54Keox
Alive2 is correct. Source is doing operations with i32s, while target does operations with i64. Thus tgt spreads poison between neighboring vector lanes.
Thanks for the quick response; so to clarify, is then a bug with how Clang is emitting IR for this instrinsic?
No, the lowering is fine. the intrinsic copies a 256-bit number.
Thanks again for your response and feel free not to respond if this is too in the weeds; but I am still confused by the issue here.
At the C level, it seems there is no way to write equivalent scalar and vector versions of the program, where a
and b
are still arrays of 32 bit ints, because of alive2's encoding (ie spreading poison between vector lanes seems to disagree with the LLVM language ref, which just treats it as a concatenated bitstring). But, if Clang's lowering was to <8 x i32> instead, it would be possible, which points the finger at Clang instead of alive2. But, as you pointed out, the 4x64 lowering is still correct. So this seems paradoxical to me. Anyway, I am mainly trying to pinpoint my own confusion here.
see this: https://godbolt.org/z/j44nd64Kq
That example will generate <8x32> loads and stores, but (1) without intrinsics (here I am mainly interested in Clang's intrinsics lowering), and (2) alive2 marks this as an incorrect transformation as well.
Specifically, with my original example, it seems the intrinsic lowering and alive2 encoding are incompatible.
There's a difference in alignment, that's all. See that adjusting alignment to 4 makes Alive2 happy: https://alive2.llvm.org/ce/z/oKxBk6
There's no bug in Alive2. You are either using the wrong intrinsic or not understanding the subtle semantics differences.
The following C snippet:
is represented as this IR:
Which loads 4 x 64bit integers (even though
a
andb
hold 32 bit integers). However, this seems to respect both the semantics of the intrinsic, and be consistent with LLVM vector semantics:_mm256_<load/store>u_si256
: "Load/store 256-bits of integer data from memory"So, I expect that copying 8 elements from
b
toa
(eg, 8 elements of 32 bits) must be equivalent to the above code, based on the explanation from (2), like this:This does not verify. However, if I change
<4 x i64>
to<8 x 32>
like below, the two codes will verify:Even though, 4x64 and 8x32 should copy the same number of bits, and the LLVM instruction ref seems to suggest the layouts should be the same. Is this an inconsistency of the vector model between alive2 and LLVM? Or, have I misunderstood the LLVM ref, and is it another issue related to how Clang treats the intrinsic, etc. Thanks for any clarification.
Below, I've included the full C example: