Suboptimal int vector ugt setcc lowering

Quuxplusone commented 5 years ago


Bugzilla Link	PR39859
Status	RESOLVED FIXED
Importance	P enhancement
Reported by	Nikita Popov (nikita.ppv@gmail.com)
Reported on	2018-12-02 02:39:28 -0800
Last modified on	2019-04-23 08:34:33 -0700
Version	trunk
Hardware	PC Windows NT
CC	andrea.dibiagio@gmail.com, craig.topper@gmail.com, danielwatson311@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)	rL349304, rL352283, rL358999
Attachments
Blocks
Blocked by
See also	PR14613, PR34621, PR40053

From https://github.com/rust-lang/rust/issues/56421, original code:

pub unsafe fn foo(x: __m128i, y: __m128i) -> __m128i {
    let cmp = _mm_cmpeq_epi16(_mm_max_epu16(x, _mm_set1_epi16(0x100)), x);
    _mm_blendv_epi8(x, y, cmp)
}

LLVM IR:

define void @foo(<2 x i64>* noalias nocapture sret dereferenceable(16), <2 x
i64>* noalias nocapture readonly dereferenceable(16) %x, <2 x i64>* noalias
nocapture readonly dereferenceable(16) %y) unnamed_addr {
start:
  %1 = load <2 x i64>, <2 x i64>* %x, align 16
  %2 = bitcast <2 x i64> %1 to <8 x i16>
  %3 = icmp ugt <8 x i16> %2, <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>
  %4 = sext <8 x i1> %3 to <8 x i16>
  %5 = bitcast <2 x i64>* %y to <16 x i8>*
  %6 = load <16 x i8>, <16 x i8>* %5, align 16
  %7 = bitcast <2 x i64> %1 to <16 x i8>
  %8 = bitcast <8 x i16> %4 to <16 x i8>
  %9 = tail call <16 x i8> @llvm.x86.sse41.pblendvb(<16 x i8> %7, <16 x i8> %6, <16 x i8> %8)
  %10 = bitcast <2 x i64>* %0 to <16 x i8>*
  store <16 x i8> %9, <16 x i8>* %10, align 16
  ret void
}

declare <16 x i8> @llvm.x86.sse41.pblendvb(<16 x i8>, <16 x i8>, <16 x i8>)

llc -mcpu=haswell gives:

    vmovdqa (%rsi), %xmm0
    vpminuw .LCPI0_0(%rip), %xmm0, %xmm1   # .LCPI0_0 is vector of 255s
    vpcmpeqw    %xmm1, %xmm0, %xmm1
    vpcmpeqd    %xmm2, %xmm2, %xmm2
    vpxor   %xmm2, %xmm1, %xmm1
    vpblendvb   %xmm1, (%rdx), %xmm0, %xmm0
    movq    %rdi, %rax
    vmovdqa %xmm0, (%rdi)
    retq

LowerVSETCC() lowers the ugt setcc as an inverted ule setcc (umin+pcmpeq).

However, in this case it would also be possible to instead adjust the constant
setcc operand by one and then lower the resulting uge setcc into umax+pcmpeq,
saving the invert and matching what the code originally intended to do.

Quuxplusone commented 5 years ago

The compiler could also generate saturating arithmetic for this particular case:

    vpsubusw .LCPI0_0(%rip), %xmm0, %xmm2
    vpxor   %xmm3, %xmm3, %xmm3
    vpcmpeqw        %xmm2, %xmm3, %xmm2
    vpblendvb       %xmm2, %xmm0, %xmm1, %xmm0

 Performance counter stats for './test.elf' (average of 5 runs):

       134,678,425      cycles:u                                                      ( +-  0.02% )  (75.77%)
       135,127,545      instructions:u            #    1.00  insn per cycle                                              ( +-  0.12% )  (75.71%)

The latency/throughput of vpsubusw is even better than pmin/pmax on Skylake (according to Agner's instruction tables, the reciprocal throughput is 0.33, versus 0.5 for min/max). On all other x86 processors, it should be as good as pmin/pmax.

On Jaguar only, the presence of a variable blend is a bit problematic because it is microcoded (it decodes into 3 COPs; total latency 2cy; RThroughput 2). So, this variant (for Jaguar only) would be better:

    vpsubusw .LCPI0_0(%rip), %xmm0, %xmm2
    vpxor           %xmm3, %xmm3, %xmm3
    vpcmpeqw        %xmm2, %xmm3, %xmm2
    vpand           %xmm0, %xmm2, %xmm3
    vpandn          %xmm1, %xmm2, %xmm2
    vpor            %xmm3, %xmm2, %xmm0

 Performance counter stats for './test.elf' (average of 5 runs):

       100,932,846      cycles:u                                                      ( +-  0.01% )  (75.08%)
       203,128,065      instructions:u            #    2.01  insn per cycle                                              ( +-  0.21% )  (74.81%)

We generate more instructions. However, the code has a better mix of FP0/FP1 instructions, and the OOO backend does a much better job at distributing uOps.

Quuxplusone commented 5 years ago

%3 = icmp ugt <8 x i16> %2, <i16 255, i16 255, i16 255, i16 255, i16 255, i16
255, i16 255, i16 255>
%4 = sext <8 x i1> %3 to <8 x i16>
%8 = bitcast <8 x i16> %4 to <16 x i8>
%9 = tail call <16 x i8> @llvm.x86.sse41.pblendvb(<16 x i8> %7, <16 x i8> %6,
<16 x i8> %8)

@spatel Is anything stopping us converting this into a select?

Quuxplusone commented 5 years ago

As a side note:

    vpminuw .LCPI0_0(%rip), %xmm0, %xmm1   # .LCPI0_0 is vector of 255s

In some cases, it may be better to materialize a splat constant like .LCPI0_0 using a all-ones idiom followed by a vector shift.

Something like this:

vpcmpeq %xmm3, %xmm3, %xmm3 ## All ones idiom. vpsrlw $8, %xmm3, %xmm3

One-idioms are dependency-breaking instructions (not zero-latency unfortunately). Vector shifts are low latency, and normally have a good throughput on most targets.

If register pressure is not problematic, then expanding that vector constant might be an option to consider.

In general, mask-like splat constants are easy to rematerialize that way.

Just an idea...

Quuxplusone commented 5 years ago

(In reply to Simon Pilgrim from comment #2)
> %3 = icmp ugt <8 x i16> %2, <i16 255, i16 255, i16 255, i16 255, i16 255,
> i16 255, i16 255, i16 255>
> %4 = sext <8 x i1> %3 to <8 x i16>
> %8 = bitcast <8 x i16> %4 to <16 x i8>
> %9 = tail call <16 x i8> @llvm.x86.sse41.pblendvb(<16 x i8> %7, <16 x i8>
> %6, <16 x i8> %8)
>
> @spatel Is anything stopping us converting this into a select?

I was going to say 'the bitcast', but we fixed that already with:
https://reviews.llvm.org/rL342806

And that works with the given IR. There's no more blendv intrinsic here:

$ opt -instcombine 39859.ll -S
  %1 = load <2 x i64>, <2 x i64>* %x, align 16
  %2 = bitcast <2 x i64> %1 to <8 x i16>
  %3 = icmp ugt <8 x i16> %2, <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>
  %4 = bitcast <2 x i64>* %y to <8 x i16>*
  %5 = load <8 x i16>, <8 x i16>* %4, align 16
  %6 = bitcast <2 x i64> %1 to <8 x i16>
  %7 = select <8 x i1> %3, <8 x i16> %5, <8 x i16> %6
  %8 = bitcast <2 x i64>* %0 to <8 x i16>*
  store <8 x i16> %7, <8 x i16>* %8, align 16

Quuxplusone commented 5 years ago

So to be clear - the IR is optimal and generic. We need a backend fix as
suggested in the description or its improvement as suggested in comment 1.

The min/max improvement would be something like this:
https://rise4fun.com/Alive/zsf

Name: x > C1 --> !(x == min(x,C1))
%ugt = icmp ugt i8 %x, C1
=>
%cmp = icmp ult i8 %x, C1
%min = select i1 %cmp, i8 %x, C1
%ule = icmp eq i8 %min, %x
%ugt = xor i1 %ule, -1

Name: x > C1 --> x == max(x, C1+1)
Pre: C1 != 255
%ugt = icmp ugt i8 %x, C1
=>
%cmp = icmp ugt i8 %x, C1+1
%max = select i1 %cmp, i8 %x, C1+1
%ugt = icmp eq i8 %max, %x

Ie, if we are comparing against a constant, then we should never need the
invert that we're currently using in LowerVSETCC().

The 'SETULT' case should be handled like this:
https://rise4fun.com/Alive/Qrl

Name: x < C1 --> x == min(x,C1+1)
Pre: C1 != 0
%ult = icmp ult i8 %x, C1
=>
%cmp = icmp ult i8 %x, C1-1
%min = select i1 %cmp, i8 %x, C1-1
%ult = icmp eq i8 %min, %x

Quuxplusone commented 5 years ago

I suspect we'll need both the min/max and psubus changes since the ISA is never quite complete.

I have a draft of the min/max code changes. I'll clean it up, add tests, and post it for review soon-ish if nobody else has already started working on this.

Quuxplusone commented 5 years ago

Patch for review:
https://reviews.llvm.org/D55515

Quuxplusone commented 5 years ago

Step 1:
https://reviews.llvm.org/rL349304

Quuxplusone commented 5 years ago

I improved handling of non-splat cases here: rL352283

Quuxplusone commented 5 years ago

Proposal for more psubus:
https://reviews.llvm.org/D60838

Quuxplusone commented 5 years ago

https://reviews.llvm.org/rL358999

That's the last step here AFAIK. The blendv vs. bitwise select issue for Jaguar is an independent bug.

Quuxplusone / LLVMBugzillaTest

Suboptimal int vector ugt setcc lowering #38831