[x86] llvm.experimental.reduce.{and, any} don't lower properly for boolean vectors

Quuxplusone commented 6 years ago


Bugzilla Link	PR36702
Status	RESOLVED FIXED
Importance	P enhancement
Reported by	Gonzalo BG (gonzalo.gadeschi@gmail.com)
Reported on	2018-03-13 08:38:32 -0700
Last modified on	2019-05-03 10:20:41 -0700
Version	trunk
Hardware	PC All
CC	aemerson@apple.com, chandlerc@gmail.com, craig.topper@gmail.com, froydnj@gmail.com, hfinkel@anl.gov, hsivonen@hsivonen.fi, josh@joshmatthews.net, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)	rL358286
Attachments
Blocks	PR37087
Blocked by
See also	PR31600, PR15391, PR36796, PR41634, PR41635, PR41636

I'd expect a way to be able to lower llvm.experimental.vector.reduce.and to
@llvm.x86.sse2.pmovmskb.128 on x86_64 and similar instructions in other
architectures when the input vectors are boolean vectors, that is, when their
lanes have either all bits cleared, or set. The same applies to
llvm.experimental.vector.reduce.any.

For example (thanks Chandler for coming up with this):

declare i1 @llvm.experimental.vector.reduce.and.i1.v16i1(<16 x i1>)
declare void @llvm.assume(i1)
declare void @a()
declare void @b()
define void @stdsimd_all_8x16(<16 x i8>* noalias nocapture dereferenceable(16)
%ptr) unnamed_addr #0 {
  %v = load <16 x i8>, <16 x i8>* %ptr, align 16
  %t = trunc <16 x i8> %v to <16 x i1>
  %e1 = sext <16 x i1> %t to <16 x i8>
  %e2 = icmp eq <16 x i8> %v, %e1
  %e3 = call i1 @llvm.experimental.vector.reduce.and.i1.v16i1(<16 x i1> %e2)
  call void @llvm.assume(i1 %e3)
  %x = call i1 @llvm.experimental.vector.reduce.and.i1.v16i1(<16 x i1> %t)
  br i1 %x, label %a, label %b
a:
  call void @a()
  ret void
b:
  call void @b()
  ret void
}

should emit:

 push    rbp
 mov     rbp, rsp
 movdqa  xmm0, xmmword, ptr, [rdi]
 pmovmskb eax, xmm0
 cmp     eax, 65535
 sete    al
 pop     rbp
 ret

and replacing "reduce.and" with "reduce.any" should emit:

 push    rbp
 mov     rbp, rsp
 movdqa  xmm0, xmmword, ptr, [rdi]
 pmovmskb eax, xmm0
 test    eax, eax
 setne   al
 pop     rbp
 ret

Quuxplusone commented 6 years ago

I'll take a look at this - I added the any_of/all_of reduction support to x86 a while ago, its probably just not liking how llvm.experimental.vector.reduce.and is expanded.

Quuxplusone commented 6 years ago

It would be great if there was a canonical way for front-ends to tell LLVM that
a vector is a vector of booleans, that is, that all the bits on each lane are
either all set or all cleared. This is what the sequence:

  %t = trunc <16 x i8> %v to <16 x i1>
  %e1 = sext <16 x i1> %t to <16 x i8>
  %e2 = icmp eq <16 x i8> %v, %e1
  %e3 = call i1 @llvm.experimental.vector.reduce.and.i1.v16i1(<16 x i1> %e2)
  call void @llvm.assume(i1 %e3)

is trying to do.

Whether a front-end should then insert something like this sequence before each
operation on a boolean vector or not, I don't know. It seems that this might
bloat the amount of LLVM-IR generated.

Quuxplusone commented 6 years ago

I'm sorry to say it doesn't make use of the assume calls at all, there might be
something we can do to insert AssertSext ops, but I doubt it'll be
straightforward.

This does seem similar to [Bug #15391] and [Bug #31600] which both discuss
adding signext/zeroext attributes to boolean vector types.

If you're mostly interested in making use of pmovmskb then you can do that
reasonably easily by something like the ir below, the codegen could still be
improved a bit in places....

I use v16i8 reductions, mainly as v16i1 will probably legalize to them anyhow.

declare i8 @llvm.experimental.vector.reduce.or.i8.v16i8(<16 x i8>)
declare i8 @llvm.experimental.vector.reduce.and.i8.v16i8(<16 x i8>)
declare void @llvm.assume(i1)

declare void @a()
declare void @b()

define void @stdsimd_all_8x16(<16 x i8>* noalias nocapture dereferenceable(16)
%ptr) unnamed_addr #0 {
  %v = load <16 x i8>, <16 x i8>* %ptr, align 16
  %t = trunc <16 x i8> %v to <16 x i1>
  %e1 = sext <16 x i1> %t to <16 x i8>
  %e2 = call i8 @llvm.experimental.vector.reduce.and.i8.v16i8(<16 x i8> %e1)
  %x = icmp ne i8 %e2, 0
  br i1 %x, label %a, label %b
a:
  call void @a()
  ret void
b:
  call void @b()
  ret void
}

.LCPI0_0:
  .zero 16,128
stdsimd_all_8x16: # @stdsimd_all_8x16
  pushq %rax
  vmovdqa (%rdi), %xmm0
  xorl %ecx, %ecx
  vpsllw $7, %xmm0, %xmm0
  vpand .LCPI0_0(%rip), %xmm0, %xmm0
  vpmovmskb %xmm0, %eax
  cmpl $65535, %eax # imm = 0xFFFF
  movl $-1, %eax
  cmovnel %ecx, %eax
  testb %al, %al
  je .LBB0_2
  callq a
  popq %rax
  retq
.LBB0_2: # %b
  callq b
  popq %rax
  retq

Quuxplusone commented 6 years ago

Simon thank you for looking into this.

What we currently do in the Rust front-end is:

* the work-around you propose
* work-around the sub-optimal code generation by pattern-matching in our front-
end for every case that we currently care a lot about (all 128-bit and 256-bit
vector types that can be used as a mask, for x86/x86_64 and for sse2 and avx,
and for arm+neon and aarch64+neon)

However, our workarounds only currently work if the whole program is compiled
with appropriate features enabled including the std library, because in many
cases we pass features directly to LLVM and whether some feature is enabled in
some context is not something that the Rust front-end actually knows or can
know, in particular when inlining is required to discover that.

So our current solution is far from satisfactory.

Quuxplusone commented 6 years ago

@Simon you can peek at our current solution here:
https://github.com/rust-lang-nursery/stdsimd/pull/425

Things are particularly bad for ARM and AArch64, where currently scalar code is
generated. For example:

## AArch64

LLVM6:

all_8x8:
 ldr     d0, [x0]
 umov    w8, v0.b[0]
 umov    w9, v0.b[1]
 tst     w8, #0xff
 umov    w10, v0.b[2]
 cset    w8, ne
 tst     w9, #0xff
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[3]
 and     w8, w8, w9
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[4]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[5]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[6]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 umov    w10, v0.b[7]
 and     w8, w9, w8
 cset    w9, ne
 tst     w10, #0xff
 and     w8, w9, w8
 cset    w9, ne
 and     w0, w9, w8
 ret
any_8x8:
 ldr     d0, [x0]
 umov    w8, v0.b[0]
 umov    w9, v0.b[1]
 orr     w8, w8, w9
 umov    w9, v0.b[2]
 orr     w8, w8, w9
 umov    w9, v0.b[3]
 orr     w8, w8, w9
 umov    w9, v0.b[4]
 orr     w8, w8, w9
 umov    w9, v0.b[5]
 orr     w8, w8, w9
 umov    w9, v0.b[6]
 orr     w8, w8, w9
 umov    w9, v0.b[7]
 orr     w8, w8, w9
 tst     w8, #0xff
 cset    w0, ne
 ret

Manually generated:

all_8x8:
 ldr     d0, [x0]
 mov     v0.d[1], v0.d[0]
 uminv   b0, v0.16b
 fmov    w8, s0
 tst     w8, #0xff
 cset    w0, ne
 ret
any_8x8:
 ldr     d0, [x0]
 mov     v0.d[1], v0.d[0]
 umaxv   b0, v0.16b
 fmov    w8, s0
 tst     w8, #0xff
 cset    w0, ne
 ret

## ARMv7+NEON

LLVM6:

all_8x8:
 vmov.i8 d0, #0x1
 vldr    d1, [r0]
 vtst.8  d0, d1, d0
 vext.8  d1, d0, d0, #4
 vand    d0, d0, d1
 vext.8  d1, d0, d0, #2
 vand    d0, d0, d1
 vdup.8  d1, d0[1]
 vand    d0, d0, d1
 vmov.u8 r0, d0[0]
 and     r0, r0, #1
 bx      lr
any_8x8:
 vmov.i8 d0, #0x1
 vldr    d1, [r0]
 vtst.8  d0, d1, d0
 vext.8  d1, d0, d0, #4
 vorr    d0, d0, d1
 vext.8  d1, d0, d0, #2
 vorr    d0, d0, d1
 vdup.8  d1, d0[1]
 vorr    d0, d0, d1
 vmov.u8 r0, d0[0]
 and     r0, r0, #1
 bx      lr

Manually generated:

all_8x8:
 vldr    d0, [r0]
 vpmin.u8 d16, d0, d16
 vpmin.u8 d16, d16, d16
 vpmin.u8 d0, d16, d16
 vmov.u8 r0, d0[0]
 bx      lr

any_8x8:
 vldr    d0, [r0]
 vpmax.u8 d16, d0, d16
 vpmax.u8 d16, d16, d16
 vpmax.u8 d0, d16, d16
 vmov.u8 r0, d0[0]
 bx      lr

Quuxplusone commented 6 years ago

On ppc64le with altivec and vsx (https://godbolt.org/g/pKsjVA) it generates:

stdsimd_all_8x16: # @stdsimd_all_8x16
.Lfunc_gep0:
  addis 2, 12, .TOC.-.Lfunc_gep0@ha
  addi 2, 2, .TOC.-.Lfunc_gep0@l
  mflr 0
  std 0, 16(1)
  stdu 1, -32(1)
  lvx 2, 0, 3
  xxswapd 35, 34
  xxland 0, 34, 35
  xxspltw 1, 0, 2
  xxland 34, 0, 1
  vsplth 19, 2, 6
  xxland 34, 34, 51
  vspltb 3, 2, 14
  xxland 13, 34, 35
  xxswapd 0, 13
  mfvsrd 12, 0
  clrldi 3, 12, 56
  andi. 3, 3, 1
  bc 4, 1, .LBB0_2
  bl a
  nop
  b .LBB0_3
.LBB0_2: # %b
  bl b
  nop
.LBB0_3: # %a
  addi 1, 1, 32
  ld 0, 16(1)
  mtlr 0
  blr
  .long 0
  .quad 0

but IIUC it should just do a single vcmpequb and read the result from CR6.

Quuxplusone commented 6 years ago

CC'ing Amara as he may be able to advise on the ARM/AARCH64 reduction codegen.

Quuxplusone commented 5 years ago

X86 improvement: https://reviews.llvm.org/D60610

Quuxplusone commented 5 years ago

(In reply to Simon Pilgrim from comment #8)
> X86 improvement: https://reviews.llvm.org/D60610

This landed at rL358286 - I'd recommend splitting off the ARM/AArch64/PowerPC
issues to their own bugs as these are all target specific.

Quuxplusone commented 5 years ago

Resolving as this bug was originally about x86 (which now behaves).

I've split off bugs for powerpc/arm/aarch64

Quuxplusone / LLVMBugzillaTest

[x86] llvm.experimental.reduce.{and, any} don't lower properly for boolean vectors #35675