Open Quuxplusone opened 10 years ago
Bugzilla Link | PR19762 |
Status | NEW |
Importance | P enhancement |
Reported by | James Molloy (james@jamesmolloy.co.uk) |
Reported on | 2014-05-16 05:03:42 -0700 |
Last modified on | 2019-05-01 03:35:14 -0700 |
Version | trunk |
Hardware | PC Linux |
CC | diogo.sampaio@arm.com, hfinkel@anl.gov, james@jamesmolloy.co.uk, kanheim@a-bix.com, llvm-bugs@lists.llvm.org, t.p.northover@gmail.com |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also |
Hi,
Would you be able to provide a specific example (test case) that demonstrates
the desired behavior? As by now, LLVM (for AArch32) generates vld1 machine
instructions from the vld1 instrinsic.
Cheers,
Conny
hi, I do have an example of this bug:
For the code:
--
#include <arm_neon.h>
int foo(int32x4_t a) {
return vgetq_lane_s32(a, 0);
}
--
Clang command:
clang --target=arm-arm-none-eabi -march=armv8-a -mfloat-abi=hard -c test.c -o -
-S -O3 -mbig-endian
We obtain:
vrev64.32 q8, q0
vmov.32 r0, d17[1]
bx
Where with GCC we obtain:
vmov.32 r0, d0[0]
bx lr
---
That seems an intrinsic problem, as compiling the code:
--
#include <arm_neon.h>
int foo(int32x4_t a) {
return a[0];
}
--
Clang gives the same result as gcc.
Looking at the LLVM-IR generated with the command:
clang -emit-llvm --target=arm-arm-none-eabi -march=armv8-a -mfloat-abi=hard -c
test.c -o - -S -O0 -mbig-endian
---
For the intrinsic we obtain:
define dso_local arm_aapcs_vfpcc i32 @foo(<4 x i32> %a) #0 {
entry:
%a.addr = alloca <4 x i32>, align 8
%__s0 = alloca <4 x i32>, align 8
%__rev0 = alloca <4 x i32>, align 8
%__ret = alloca i32, align 4
%tmp = alloca i32, align 4
store <4 x i32> %a, <4 x i32>* %a.addr, align 8
%0 = load <4 x i32>, <4 x i32>* %a.addr, align 8
store <4 x i32> %0, <4 x i32>* %__s0, align 8
%1 = load <4 x i32>, <4 x i32>* %__s0, align 8
%2 = load <4 x i32>, <4 x i32>* %__s0, align 8
%shuffle = shufflevector <4 x i32> %1, <4 x i32> %2, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
store <4 x i32> %shuffle, <4 x i32>* %__rev0, align 8
%3 = load <4 x i32>, <4 x i32>* %__rev0, align 8
%4 = bitcast <4 x i32> %3 to <16 x i8>
%5 = bitcast <16 x i8> %4 to <4 x i32>
%vget_lane = extractelement <4 x i32> %5, i32 0
store i32 %vget_lane, i32* %__ret, align 4
%6 = load i32, i32* %__ret, align 4
store i32 %6, i32* %tmp, align 4
%7 = load i32, i32* %tmp, align 4
ret i32 %7
}
---
With an incorrect shufflevector there.
Where for returning the value by a[0], we obtain the code:
---
define dso_local arm_aapcs_vfpcc i32 @foo2(<4 x i32> %a) #0 {
entry:
%a.addr = alloca <4 x i32>, align 8
store <4 x i32> %a, <4 x i32>* %a.addr, align 8
%0 = load <4 x i32>, <4 x i32>* %a.addr, align 8
%vecext = extractelement <4 x i32> %0, i32 0
ret i32 %vecext
}
---
As gcc.
Hi,
"""
We obtain:
vrev64.32 q8, q0
vmov.32 r0, d17[1]
bx
Where with GCC we obtain:
vmov.32 r0, d0[0]
bx lr
"""
These two sequences are equivalent. Clang reverses then reads the 3rd lane, GCC
does not reverse then reads the 0th lane.
This is due to the way we represent lane indices in LLVM, and not caring enough
to implement obvious fixup patterns (like rev/extract_elt(i) -> extract_elt(n-i-
1)).
The rationale and design is documented here:
http://llvm.org/docs/BigEndianNEON.html
Cheers,
James
"""
#include <arm_neon.h>
int foo(int32x4_t a) {
return a[0];
}
"""
Note that, unlike NEON intrinsics, the semantics of square bracket notation
isn't defined anywhere for ARM which is why you end up with different code
generated.
Indeed true James, thanks. I was just confused that vrev was also reverting the bytes inside each element, as if it was converting memory to register representation. Reading the definition again I see I misread it.
Coming back to the same:
so let the 128bit vector being passed be i32 elements : {a, b, c, d}
What gcc does is return {a}
Clang does:
vrev64.32 q8 10: That generates: {b, a, d, c}
| d16 | d17 |
And returns the last element (d17[1]) that is {c}, where it should return the
second element, (d16[1]) that is {a}.