possible vector instrinsics inconsistency?

avery-laird commented 1 month ago

The following C snippet:

_mm256_storeu_si256((__m256i*)&a[i], _mm256_loadu_si256((__m256i*)&b[i]));

is represented as this IR:

@b = external local_unnamed_addr global [0 x i32], align 4
@a = external local_unnamed_addr global [0 x i32], align 4

define dso_local void @tgt(i32 noundef %0) local_unnamed_addr #1 {
  %2 = sext i32 %0 to i64
  %3 = getelementptr inbounds [0 x i64], ptr @a, i64 0, i64 %2
  %4 = getelementptr inbounds [0 x i64], ptr @b, i64 0, i64 %2
  %5 = load <4 x i64>, ptr %4, align 8
  store <4 x i64> %5, ptr %3, align 8
  ret void
}

Which loads 4 x 64bit integers (even though a and b hold 32 bit integers). However, this seems to respect both the semantics of the intrinsic, and be consistent with LLVM vector semantics:

_mm256_<load/store>u_si256: "Load/store 256-bits of integer data from memory"
"One way to describe the layout is by describing what happens when a vector such as is bitcasted to an integer type with N*M bits, and then following the rules for storing such an integer to memory." (LLVM instruction ref)

So, I expect that copying 8 elements from b to a (eg, 8 elements of 32 bits) must be equivalent to the above code, based on the explanation from (2), like this:

a[i] = b[i];
a[i+1] = b[i+1];
a[i+2] = b[i+2];
a[i+3] = b[i+3];
a[i+4] = b[i+4];
a[i+5] = b[i+5];
a[i+6] = b[i+6];
a[i+7] = b[i+7];

This does not verify. However, if I change <4 x i64> to <8 x 32> like below, the two codes will verify:

define dso_local void @tgt(i32 noundef %0) local_unnamed_addr #1 {
  %2 = sext i32 %0 to i64
  %3 = getelementptr inbounds [0 x i64], ptr @a, i64 0, i64 %2
  %4 = getelementptr inbounds [0 x i64], ptr @b, i64 0, i64 %2
  %5 = load <8 x i32>, ptr %4, align 8
  store <8 x i32> %5, ptr %3, align 8
  ret void
}

Even though, 4x64 and 8x32 should copy the same number of bits, and the LLVM instruction ref seems to suggest the layouts should be the same. Is this an inconsistency of the vector model between alive2 and LLVM? Or, have I misunderstood the LLVM ref, and is it another issue related to how Clang treats the intrinsic, etc. Thanks for any clarification.

Below, I've included the full C example:

#include <stdint.h>
#include <immintrin.h>

extern int a[];
extern int b[];

void src(int i) {
   a[i] = b[i];
   a[i+1] = b[i+1];
   a[i+2] = b[i+2];
   a[i+3] = b[i+3];
   a[i+4] = b[i+4];
   a[i+5] = b[i+5];
   a[i+6] = b[i+6];
   a[i+7] = b[i+7];
}

void tgt(int i) {
   _mm256_storeu_si256((__m256i*)&a[i], _mm256_loadu_si256((__m256i*)&b[i]));
}

nunoplopes commented 1 month ago

Can you give a link to https://alive2.llvm.org with the src and tgt that demonstrates the problem?

avery-laird commented 1 month ago

Here is an example: https://alive2.llvm.org/ce/z/wM9hg4

Plus, the C->IR: https://godbolt.org/z/E6v54Keox

nunoplopes commented 1 month ago

Alive2 is correct. Source is doing operations with i32s, while target does operations with i64. Thus tgt spreads poison between neighboring vector lanes.

avery-laird commented 1 month ago

Thanks for the quick response; so to clarify, is then a bug with how Clang is emitting IR for this instrinsic?

nunoplopes commented 1 month ago

No, the lowering is fine. the intrinsic copies a 256-bit number.

avery-laird commented 1 month ago

Thanks again for your response and feel free not to respond if this is too in the weeds; but I am still confused by the issue here.

At the C level, it seems there is no way to write equivalent scalar and vector versions of the program, where a and b are still arrays of 32 bit ints, because of alive2's encoding (ie spreading poison between vector lanes seems to disagree with the LLVM language ref, which just treats it as a concatenated bitstring). But, if Clang's lowering was to <8 x i32> instead, it would be possible, which points the finger at Clang instead of alive2. But, as you pointed out, the 4x64 lowering is still correct. So this seems paradoxical to me. Anyway, I am mainly trying to pinpoint my own confusion here.

nunoplopes commented 1 month ago

see this: https://godbolt.org/z/j44nd64Kq

avery-laird commented 1 month ago

That example will generate <8x32> loads and stores, but (1) without intrinsics (here I am mainly interested in Clang's intrinsics lowering), and (2) alive2 marks this as an incorrect transformation as well.

Specifically, with my original example, it seems the intrinsic lowering and alive2 encoding are incompatible.

nunoplopes commented 1 month ago

There's a difference in alignment, that's all. See that adjusting alignment to 4 makes Alive2 happy: https://alive2.llvm.org/ce/z/oKxBk6

There's no bug in Alive2. You are either using the wrong intrinsic or not understanding the subtle semantics differences.

AliveToolkit / alive2

possible vector instrinsics inconsistency? #1045