Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

_mm_load_si128() not generating 128-bit atomic stores. #46395

Open Quuxplusone opened 4 years ago

Quuxplusone commented 4 years ago
Bugzilla Link PR47426
Status NEW
Importance P enhancement
Reported by Anmol P. Paralkar (anmparal@cisco.com)
Reported on 2020-09-04 19:18:18 -0700
Last modified on 2020-09-04 20:58:47 -0700
Version trunk
Hardware PC All
CC anmparal@cisco.com, craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
Given the test:

  1 #include <stdint.h>
  2 #include <xmmintrin.h>
  3 #include <x86intrin.h>
  4
  5 uint32_t read_128b(__m128i *ptr)
  6 {
  7   __m128i val = _mm_load_si128(ptr);
  8   return ((uint32_t *) &val)[0]|
  9          ((uint32_t *) &val)[1]|
 10          ((uint32_t *) &val)[2]|
 11          ((uint32_t *) &val)[3];
 12 }

 With clang version 12.0.0 (https://github.com/llvm/llvm-project.git
 4eef14f9780d9fc9a88096a3cabd669bcfa02bbc 09/04/2020) the _mm_load_si128()
 is translated at '-O2 -msse2' to:

        movq    (%rdi), %rcx
        movq    8(%rdi), %rdx

 This is not in accordance with Ref. [0], which specifies:

 Synopsis
 __m128i _mm_load_si128 (__m128i const* mem_addr)
 #include <emmintrin.h>
 Instruction: movdqa xmm, m128
 CPUID Flags: SSE2

 (Note: gcc-10.1.0 and icc.16.0.5.027b both generate a movdqa as expected).

 The accesses at lines 8 thro' 11 cause the problematic 64-bit loads; modifying
 the code (see marker: '<<<') so that:

  1 #include <stdint.h>
  2 #include <xmmintrin.h>
  3 #include <x86intrin.h>
  4
  5 uint32_t read_128b(__m128i *ptr, uint8_t index) <<<
  6 {
  7   __m128i val = _mm_load_si128(ptr);
  8   return ((uint32_t *) &val)[index];            <<<
  9 }

 - we see that the _mm_load_si128() is translated to: movaps  (%rdi), %xmm0
 as expected. (Note: Per Ref. [1], movaps and movdqa are interchangeable).

 The _mm_load_si128() builtin is defined in: clang/lib/Headers/emmintrin.h
 with attribute: __min_vector_width__(128)

 #define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, \
                                           __target__("sse2"),             \
                                           __min_vector_width__(128)))
 ...
 /// ...
 /// This intrinsic corresponds to the <c> VMOVDQA / MOVDQA </c> instruction.
 /// ...
 static __inline__ __m128i __DEFAULT_FN_ATTRS
 _mm_load_si128(__m128i const *__p)
 {
   return *__p;
 }

 Per Ref. [2], "This attribute may be attached to a function and informs the
 backend that this function desires vectors of at least this width to be
 generated. ... This attribute is meant to be a hint to control target
 heuristics that may generate narrower vectors than what the target hardware
 supports." So, it is reasonable to expect that the vector pointed to by '__p'
 is always treated in its 128-bit entirety.

 _mm_load_si128() is converted to the following optimal LLVM IR:

 ; Function Attrs: alwaysinline norecurse nounwind readonly uwtable
 define internal fastcc <2 x i64> @_mm_load_si128(<2 x i64>* nocapture readonly
                                                        %__p) unnamed_addr #2 {
 entry:
   %0 = load <2 x i64>, <2 x i64>* %__p, align 16, !tbaa !2
   ret <2 x i64> %0
 }

 The Function Integration/Inlining pass inlines this _mm_load_si128() body into
 read_128b():

 %0 = load <2 x i64>, <2 x i64>* %ptr, align 16, !tbaa !2

 However, (owing to the 32-bit accesses in the subsequent |-expression),
 the Combine redundant instructions pass converts this load to:

 %1 = load i128, i128* %0, align 16, !tbaa !2

 - which, the X86 DAG->DAG Instruction Selection pass converts to:

  %1:gr64 = MOV64rm %0:gr64, 1, $noreg, 0, $noreg :: \
            (load 8 from %ir.0, align 16, !tbaa !2)
  %2:gr64 = MOV64rm %0:gr64, 1, $noreg, 8, $noreg :: \
            (load 8 from %ir.0 + 8, align 16, !tbaa !2)

 - the problematic 64-bit loads.

 Per Ref. [3]/Rationale: "Platforms may rely on volatile loads and stores of
 natively supported data width to be executed as single instruction. For
 example, in C this holds for an l-value of volatile primitive type with native
 hardware support, but not necessarily for aggregate types. The frontend upholds
 these expectations, which are intentionally unspecified in the IR. The rules
 above ensure that IR transformations do not violate the frontend’s contract
 with the language."

 Thus, the LLVM IR generated for the loads and stores in a function with the
 __attribute__((min_vector_width(width))) that operate on vectors 'width'-wide
 should satisfy the properties:

  a. at-least 'width'-wide
  b. marked 'volatile' (to prevent any subsequent phases from splitting them up)

 Assuming that property-a is correctly maintained by the front-end; the problem
 reduces to ensuring that property-b holds.

 Hand-modifying the generated LLVM IR:

   define internal <2 x i64> @_mm_load_si128(<2 x i64>* %__p) #2 {
   entry:
 !   %0 = load <2 x i64>, <2 x i64>* %__p, align 16
     ret <2 x i64> %0
   }

 --- 1,11 ----
   define internal <2 x i64> @_mm_load_si128(<2 x i64>* %__p) #2 {
   entry:
 !   %0 = load volatile <2 x i64>, <2 x i64>* %__p, align 16
     ret <2 x i64> %0
   }

 - we see that the 'load volatile <2 x i64>' does get converted to a 'movdqa',
 as expected.

 PS: the same issue is also seen with __m256i, __m512i and with
     _mm_store_si128(), ...

 I need your input on which of the following directions to take to fix this issue:

 * Marking the load-stores in the intrinsic as volatile during LLVM IR
   generation.

 * (Under an option) prohibiting the Combine redundant instructions pass from
   modifying the vector-load load <2 x i64> into load i128.

 * Making X86 DAG->DAG Instruction Selection generate VMOVDQArm instead of two
   MOV64rm’s on load i128 for SSE.

 References:

 0. https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE2
 1. Difference between MOVDQA and MOVAPS x86 instructions?
    https://stackoverflow.com/questions/6678073/\
            difference-between-movdqa-and-movaps-x86-instructions
 2. Clang supports the __attribute__((min_vector_width(width))) attribute.
    https://clang.llvm.org/docs/AttributeReference.html#min-vector-width
 3. Volatile Memory Accesses
    https://llvm.org/docs/LangRef.html#id1277
Quuxplusone commented 4 years ago

The min_vector_width attribute is currently only used to indicate that the compiler should honor 512-bit intrinsics. There's a drop in CPU frequency on some CPUs when using those instructions so the default behavior is for the auto vectorizers to avoid them on those CPUs. The attribute disables this behavior.

Can you clarify why these loads being split is problematic beyond not matching the documentation? You used the word "atomic" in the bug title, but neither Intel nor AMD guaranteed atomic memory access for anything larger than 8 bytes except for cmpxchg16b.