Closed Quuxplusone closed 5 years ago
Bugzilla Link | PR36702 |
Status | RESOLVED FIXED |
Importance | P enhancement |
Reported by | Gonzalo BG (gonzalo.gadeschi@gmail.com) |
Reported on | 2018-03-13 08:38:32 -0700 |
Last modified on | 2019-05-03 10:20:41 -0700 |
Version | trunk |
Hardware | PC All |
CC | aemerson@apple.com, chandlerc@gmail.com, craig.topper@gmail.com, froydnj@gmail.com, hfinkel@anl.gov, hsivonen@hsivonen.fi, josh@joshmatthews.net, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com |
Fixed by commit(s) | rL358286 |
Attachments | |
Blocks | PR37087 |
Blocked by | |
See also | PR31600, PR15391, PR36796, PR41634, PR41635, PR41636 |
I'll take a look at this - I added the any_of/all_of reduction support to x86 a while ago, its probably just not liking how llvm.experimental.vector.reduce.and is expanded.
It would be great if there was a canonical way for front-ends to tell LLVM that
a vector is a vector of booleans, that is, that all the bits on each lane are
either all set or all cleared. This is what the sequence:
%t = trunc <16 x i8> %v to <16 x i1>
%e1 = sext <16 x i1> %t to <16 x i8>
%e2 = icmp eq <16 x i8> %v, %e1
%e3 = call i1 @llvm.experimental.vector.reduce.and.i1.v16i1(<16 x i1> %e2)
call void @llvm.assume(i1 %e3)
is trying to do.
Whether a front-end should then insert something like this sequence before each
operation on a boolean vector or not, I don't know. It seems that this might
bloat the amount of LLVM-IR generated.
I'm sorry to say it doesn't make use of the assume calls at all, there might be
something we can do to insert AssertSext ops, but I doubt it'll be
straightforward.
This does seem similar to [Bug #15391] and [Bug #31600] which both discuss
adding signext/zeroext attributes to boolean vector types.
If you're mostly interested in making use of pmovmskb then you can do that
reasonably easily by something like the ir below, the codegen could still be
improved a bit in places....
I use v16i8 reductions, mainly as v16i1 will probably legalize to them anyhow.
declare i8 @llvm.experimental.vector.reduce.or.i8.v16i8(<16 x i8>)
declare i8 @llvm.experimental.vector.reduce.and.i8.v16i8(<16 x i8>)
declare void @llvm.assume(i1)
declare void @a()
declare void @b()
define void @stdsimd_all_8x16(<16 x i8>* noalias nocapture dereferenceable(16)
%ptr) unnamed_addr #0 {
%v = load <16 x i8>, <16 x i8>* %ptr, align 16
%t = trunc <16 x i8> %v to <16 x i1>
%e1 = sext <16 x i1> %t to <16 x i8>
%e2 = call i8 @llvm.experimental.vector.reduce.and.i8.v16i8(<16 x i8> %e1)
%x = icmp ne i8 %e2, 0
br i1 %x, label %a, label %b
a:
call void @a()
ret void
b:
call void @b()
ret void
}
.LCPI0_0:
.zero 16,128
stdsimd_all_8x16: # @stdsimd_all_8x16
pushq %rax
vmovdqa (%rdi), %xmm0
xorl %ecx, %ecx
vpsllw $7, %xmm0, %xmm0
vpand .LCPI0_0(%rip), %xmm0, %xmm0
vpmovmskb %xmm0, %eax
cmpl $65535, %eax # imm = 0xFFFF
movl $-1, %eax
cmovnel %ecx, %eax
testb %al, %al
je .LBB0_2
callq a
popq %rax
retq
.LBB0_2: # %b
callq b
popq %rax
retq
Simon thank you for looking into this.
What we currently do in the Rust front-end is:
* the work-around you propose
* work-around the sub-optimal code generation by pattern-matching in our front-
end for every case that we currently care a lot about (all 128-bit and 256-bit
vector types that can be used as a mask, for x86/x86_64 and for sse2 and avx,
and for arm+neon and aarch64+neon)
However, our workarounds only currently work if the whole program is compiled
with appropriate features enabled including the std library, because in many
cases we pass features directly to LLVM and whether some feature is enabled in
some context is not something that the Rust front-end actually knows or can
know, in particular when inlining is required to discover that.
So our current solution is far from satisfactory.
@Simon you can peek at our current solution here:
https://github.com/rust-lang-nursery/stdsimd/pull/425
Things are particularly bad for ARM and AArch64, where currently scalar code is
generated. For example:
## AArch64
LLVM6:
all_8x8:
ldr d0, [x0]
umov w8, v0.b[0]
umov w9, v0.b[1]
tst w8, #0xff
umov w10, v0.b[2]
cset w8, ne
tst w9, #0xff
cset w9, ne
tst w10, #0xff
umov w10, v0.b[3]
and w8, w8, w9
cset w9, ne
tst w10, #0xff
umov w10, v0.b[4]
and w8, w9, w8
cset w9, ne
tst w10, #0xff
umov w10, v0.b[5]
and w8, w9, w8
cset w9, ne
tst w10, #0xff
umov w10, v0.b[6]
and w8, w9, w8
cset w9, ne
tst w10, #0xff
umov w10, v0.b[7]
and w8, w9, w8
cset w9, ne
tst w10, #0xff
and w8, w9, w8
cset w9, ne
and w0, w9, w8
ret
any_8x8:
ldr d0, [x0]
umov w8, v0.b[0]
umov w9, v0.b[1]
orr w8, w8, w9
umov w9, v0.b[2]
orr w8, w8, w9
umov w9, v0.b[3]
orr w8, w8, w9
umov w9, v0.b[4]
orr w8, w8, w9
umov w9, v0.b[5]
orr w8, w8, w9
umov w9, v0.b[6]
orr w8, w8, w9
umov w9, v0.b[7]
orr w8, w8, w9
tst w8, #0xff
cset w0, ne
ret
Manually generated:
all_8x8:
ldr d0, [x0]
mov v0.d[1], v0.d[0]
uminv b0, v0.16b
fmov w8, s0
tst w8, #0xff
cset w0, ne
ret
any_8x8:
ldr d0, [x0]
mov v0.d[1], v0.d[0]
umaxv b0, v0.16b
fmov w8, s0
tst w8, #0xff
cset w0, ne
ret
## ARMv7+NEON
LLVM6:
all_8x8:
vmov.i8 d0, #0x1
vldr d1, [r0]
vtst.8 d0, d1, d0
vext.8 d1, d0, d0, #4
vand d0, d0, d1
vext.8 d1, d0, d0, #2
vand d0, d0, d1
vdup.8 d1, d0[1]
vand d0, d0, d1
vmov.u8 r0, d0[0]
and r0, r0, #1
bx lr
any_8x8:
vmov.i8 d0, #0x1
vldr d1, [r0]
vtst.8 d0, d1, d0
vext.8 d1, d0, d0, #4
vorr d0, d0, d1
vext.8 d1, d0, d0, #2
vorr d0, d0, d1
vdup.8 d1, d0[1]
vorr d0, d0, d1
vmov.u8 r0, d0[0]
and r0, r0, #1
bx lr
Manually generated:
all_8x8:
vldr d0, [r0]
vpmin.u8 d16, d0, d16
vpmin.u8 d16, d16, d16
vpmin.u8 d0, d16, d16
vmov.u8 r0, d0[0]
bx lr
any_8x8:
vldr d0, [r0]
vpmax.u8 d16, d0, d16
vpmax.u8 d16, d16, d16
vpmax.u8 d0, d16, d16
vmov.u8 r0, d0[0]
bx lr
On ppc64le with altivec and vsx (https://godbolt.org/g/pKsjVA) it generates:
stdsimd_all_8x16: # @stdsimd_all_8x16
.Lfunc_gep0:
addis 2, 12, .TOC.-.Lfunc_gep0@ha
addi 2, 2, .TOC.-.Lfunc_gep0@l
mflr 0
std 0, 16(1)
stdu 1, -32(1)
lvx 2, 0, 3
xxswapd 35, 34
xxland 0, 34, 35
xxspltw 1, 0, 2
xxland 34, 0, 1
vsplth 19, 2, 6
xxland 34, 34, 51
vspltb 3, 2, 14
xxland 13, 34, 35
xxswapd 0, 13
mfvsrd 12, 0
clrldi 3, 12, 56
andi. 3, 3, 1
bc 4, 1, .LBB0_2
bl a
nop
b .LBB0_3
.LBB0_2: # %b
bl b
nop
.LBB0_3: # %a
addi 1, 1, 32
ld 0, 16(1)
mtlr 0
blr
.long 0
.quad 0
but IIUC it should just do a single vcmpequb and read the result from CR6.
CC'ing Amara as he may be able to advise on the ARM/AARCH64 reduction codegen.
X86 improvement: https://reviews.llvm.org/D60610
(In reply to Simon Pilgrim from comment #8)
> X86 improvement: https://reviews.llvm.org/D60610
This landed at rL358286 - I'd recommend splitting off the ARM/AArch64/PowerPC
issues to their own bugs as these are all target specific.
Resolving as this bug was originally about x86 (which now behaves).
I've split off bugs for powerpc/arm/aarch64