Big endian vector intrinsics are not compatible with GCC - Githubissues

Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Big endian vector intrinsics are not compatible with GCC #19761

Open Quuxplusone opened 10 years ago

Quuxplusone commented 10 years ago


Bugzilla Link	PR19762
Status	NEW
Importance	P enhancement
Reported by	James Molloy (james@jamesmolloy.co.uk)
Reported on	2014-05-16 05:03:42 -0700
Last modified on	2019-05-01 03:35:14 -0700
Version	trunk
Hardware	PC Linux
CC	diogo.sampaio@arm.com, hfinkel@anl.gov, james@jamesmolloy.co.uk, kanheim@a-bix.com, llvm-bugs@lists.llvm.org, t.p.northover@gmail.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

During a discussion with the GCC folks, two faults in Clang (big endian) were
identified:

   # The lane index to the lane-based vector intrinsics (such as vget_lane) is being treated as the logical lane, not the architectural lane. Richard Earnshaw has confirmed that it should be the architectural lane "as if" loaded by LDR.
   # The LD1 intrinsic is a user override and the compiler should not undo the LD1. The LD1 intrinsic is lowered to a normal LOAD node, so the compiler treats it like any load and ensures it acts as if the load had been performed by LDR. But LD1 should override this behaviour, and the load should be performed as if it were loaded with LD1, not LDR.

The following should be done to fix this:

   # Invert the LLVM-IR lane index created for all v*_lane functions.
   # Perform a reversal on the outcome of a vld1_ intrinsic. With this reversal, the compiler will do the right thing.

Bug 19392 (http://llvm.org/bugs/show_bug.cgi?id=19392) has been reopened for
ARM64. This bug is for AArch32.

Quuxplusone commented 10 years ago

Hi,

Would you be able to provide a specific example (test case) that demonstrates
the desired behavior? As by now, LLVM (for AArch32) generates vld1 machine
instructions from the vld1 instrinsic.

Cheers,
Conny

Quuxplusone commented 5 years ago

hi, I do have an example of this bug:
For the code:
--
#include <arm_neon.h>

int foo(int32x4_t a) {
  return vgetq_lane_s32(a, 0);
}
--
Clang command:
clang --target=arm-arm-none-eabi -march=armv8-a -mfloat-abi=hard -c test.c -o -
-S -O3 -mbig-endian

We obtain:
        vrev64.32       q8, q0
        vmov.32 r0, d17[1]
        bx

Where with GCC we obtain:
        vmov.32 r0, d0[0]
        bx      lr
---
That seems an intrinsic problem, as compiling the code:
--
#include <arm_neon.h>

int foo(int32x4_t a) {
  return a[0];
}
--
Clang gives the same result as gcc.

Quuxplusone commented 5 years ago

Looking at the LLVM-IR generated with the command:

clang -emit-llvm --target=arm-arm-none-eabi -march=armv8-a -mfloat-abi=hard -c
test.c -o - -S -O0 -mbig-endian

---
For the intrinsic we obtain:
define dso_local arm_aapcs_vfpcc i32 @foo(<4 x i32> %a) #0 {
entry:
  %a.addr = alloca <4 x i32>, align 8
  %__s0 = alloca <4 x i32>, align 8
  %__rev0 = alloca <4 x i32>, align 8
  %__ret = alloca i32, align 4
  %tmp = alloca i32, align 4
  store <4 x i32> %a, <4 x i32>* %a.addr, align 8
  %0 = load <4 x i32>, <4 x i32>* %a.addr, align 8
  store <4 x i32> %0, <4 x i32>* %__s0, align 8
  %1 = load <4 x i32>, <4 x i32>* %__s0, align 8
  %2 = load <4 x i32>, <4 x i32>* %__s0, align 8
  %shuffle = shufflevector <4 x i32> %1, <4 x i32> %2, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  store <4 x i32> %shuffle, <4 x i32>* %__rev0, align 8
  %3 = load <4 x i32>, <4 x i32>* %__rev0, align 8
  %4 = bitcast <4 x i32> %3 to <16 x i8>
  %5 = bitcast <16 x i8> %4 to <4 x i32>
  %vget_lane = extractelement <4 x i32> %5, i32 0
  store i32 %vget_lane, i32* %__ret, align 4
  %6 = load i32, i32* %__ret, align 4
  store i32 %6, i32* %tmp, align 4
  %7 = load i32, i32* %tmp, align 4
  ret i32 %7
}
---
With an incorrect shufflevector there.

Where for returning the value by a[0], we obtain the code:
---
define dso_local arm_aapcs_vfpcc i32 @foo2(<4 x i32> %a) #0 {
entry:
  %a.addr = alloca <4 x i32>, align 8
  store <4 x i32> %a, <4 x i32>* %a.addr, align 8
  %0 = load <4 x i32>, <4 x i32>* %a.addr, align 8
  %vecext = extractelement <4 x i32> %0, i32 0
  ret i32 %vecext
}
---
As gcc.

Quuxplusone commented 5 years ago

Hi,

"""
We obtain:
        vrev64.32       q8, q0
        vmov.32 r0, d17[1]
        bx

Where with GCC we obtain:
        vmov.32 r0, d0[0]
        bx      lr
"""

These two sequences are equivalent. Clang reverses then reads the 3rd lane, GCC
does not reverse then reads the 0th lane.

This is due to the way we represent lane indices in LLVM, and not caring enough
to implement obvious fixup patterns (like rev/extract_elt(i) -> extract_elt(n-i-
1)).

The rationale and design is documented here:
http://llvm.org/docs/BigEndianNEON.html

Cheers,

James

Quuxplusone commented 5 years ago

"""
#include <arm_neon.h>

int foo(int32x4_t a) {
  return a[0];
}
"""

Note that, unlike NEON intrinsics, the semantics of square bracket notation
isn't defined anywhere for ARM which is why you end up with different code
generated.

Quuxplusone commented 5 years ago

Indeed true James, thanks. I was just confused that vrev was also reverting the bytes inside each element, as if it was converting memory to register representation. Reading the definition again I see I misread it.

Quuxplusone commented 5 years ago

Coming back to the same:

so let the 128bit vector being passed be i32 elements : {a, b, c, d}

What gcc does is return {a}

Clang does:

vrev64.32 q8 10: That generates: {b, a, d, c}
                                | d16 | d17 |
And returns the last element (d17[1]) that is {c}, where it should return the
second element, (d16[1]) that is {a}.