Closed Quuxplusone closed 5 years ago
Bugzilla Link | PR39859 |
Status | RESOLVED FIXED |
Importance | P enhancement |
Reported by | Nikita Popov (nikita.ppv@gmail.com) |
Reported on | 2018-12-02 02:39:28 -0800 |
Last modified on | 2019-04-23 08:34:33 -0700 |
Version | trunk |
Hardware | PC Windows NT |
CC | andrea.dibiagio@gmail.com, craig.topper@gmail.com, danielwatson311@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com |
Fixed by commit(s) | rL349304, rL352283, rL358999 |
Attachments | |
Blocks | |
Blocked by | |
See also | PR14613, PR34621, PR40053 |
The compiler could also generate saturating arithmetic for this particular case:
vpsubusw .LCPI0_0(%rip), %xmm0, %xmm2
vpxor %xmm3, %xmm3, %xmm3
vpcmpeqw %xmm2, %xmm3, %xmm2
vpblendvb %xmm2, %xmm0, %xmm1, %xmm0
Performance counter stats for './test.elf' (average of 5 runs):
134,678,425 cycles:u ( +- 0.02% ) (75.77%)
135,127,545 instructions:u # 1.00 insn per cycle ( +- 0.12% ) (75.71%)
The latency/throughput of vpsubusw is even better than pmin/pmax on Skylake (according to Agner's instruction tables, the reciprocal throughput is 0.33, versus 0.5 for min/max). On all other x86 processors, it should be as good as pmin/pmax.
On Jaguar only, the presence of a variable blend is a bit problematic because it is microcoded (it decodes into 3 COPs; total latency 2cy; RThroughput 2). So, this variant (for Jaguar only) would be better:
vpsubusw .LCPI0_0(%rip), %xmm0, %xmm2
vpxor %xmm3, %xmm3, %xmm3
vpcmpeqw %xmm2, %xmm3, %xmm2
vpand %xmm0, %xmm2, %xmm3
vpandn %xmm1, %xmm2, %xmm2
vpor %xmm3, %xmm2, %xmm0
Performance counter stats for './test.elf' (average of 5 runs):
100,932,846 cycles:u ( +- 0.01% ) (75.08%)
203,128,065 instructions:u # 2.01 insn per cycle ( +- 0.21% ) (74.81%)
We generate more instructions. However, the code has a better mix of FP0/FP1 instructions, and the OOO backend does a much better job at distributing uOps.
%3 = icmp ugt <8 x i16> %2, <i16 255, i16 255, i16 255, i16 255, i16 255, i16
255, i16 255, i16 255>
%4 = sext <8 x i1> %3 to <8 x i16>
%8 = bitcast <8 x i16> %4 to <16 x i8>
%9 = tail call <16 x i8> @llvm.x86.sse41.pblendvb(<16 x i8> %7, <16 x i8> %6,
<16 x i8> %8)
@spatel Is anything stopping us converting this into a select?
As a side note:
vpminuw .LCPI0_0(%rip), %xmm0, %xmm1 # .LCPI0_0 is vector of 255s
In some cases, it may be better to materialize a splat constant like .LCPI0_0 using a all-ones idiom followed by a vector shift.
Something like this:
vpcmpeq %xmm3, %xmm3, %xmm3 ## All ones idiom. vpsrlw $8, %xmm3, %xmm3
One-idioms are dependency-breaking instructions (not zero-latency unfortunately). Vector shifts are low latency, and normally have a good throughput on most targets.
If register pressure is not problematic, then expanding that vector constant might be an option to consider.
In general, mask-like splat constants are easy to rematerialize that way.
Just an idea...
(In reply to Simon Pilgrim from comment #2)
> %3 = icmp ugt <8 x i16> %2, <i16 255, i16 255, i16 255, i16 255, i16 255,
> i16 255, i16 255, i16 255>
> %4 = sext <8 x i1> %3 to <8 x i16>
> %8 = bitcast <8 x i16> %4 to <16 x i8>
> %9 = tail call <16 x i8> @llvm.x86.sse41.pblendvb(<16 x i8> %7, <16 x i8>
> %6, <16 x i8> %8)
>
> @spatel Is anything stopping us converting this into a select?
I was going to say 'the bitcast', but we fixed that already with:
https://reviews.llvm.org/rL342806
And that works with the given IR. There's no more blendv intrinsic here:
$ opt -instcombine 39859.ll -S
%1 = load <2 x i64>, <2 x i64>* %x, align 16
%2 = bitcast <2 x i64> %1 to <8 x i16>
%3 = icmp ugt <8 x i16> %2, <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>
%4 = bitcast <2 x i64>* %y to <8 x i16>*
%5 = load <8 x i16>, <8 x i16>* %4, align 16
%6 = bitcast <2 x i64> %1 to <8 x i16>
%7 = select <8 x i1> %3, <8 x i16> %5, <8 x i16> %6
%8 = bitcast <2 x i64>* %0 to <8 x i16>*
store <8 x i16> %7, <8 x i16>* %8, align 16
So to be clear - the IR is optimal and generic. We need a backend fix as
suggested in the description or its improvement as suggested in comment 1.
The min/max improvement would be something like this:
https://rise4fun.com/Alive/zsf
Name: x > C1 --> !(x == min(x,C1))
%ugt = icmp ugt i8 %x, C1
=>
%cmp = icmp ult i8 %x, C1
%min = select i1 %cmp, i8 %x, C1
%ule = icmp eq i8 %min, %x
%ugt = xor i1 %ule, -1
Name: x > C1 --> x == max(x, C1+1)
Pre: C1 != 255
%ugt = icmp ugt i8 %x, C1
=>
%cmp = icmp ugt i8 %x, C1+1
%max = select i1 %cmp, i8 %x, C1+1
%ugt = icmp eq i8 %max, %x
Ie, if we are comparing against a constant, then we should never need the
invert that we're currently using in LowerVSETCC().
The 'SETULT' case should be handled like this:
https://rise4fun.com/Alive/Qrl
Name: x < C1 --> x == min(x,C1+1)
Pre: C1 != 0
%ult = icmp ult i8 %x, C1
=>
%cmp = icmp ult i8 %x, C1-1
%min = select i1 %cmp, i8 %x, C1-1
%ult = icmp eq i8 %min, %x
I suspect we'll need both the min/max and psubus changes since the ISA is never quite complete.
I have a draft of the min/max code changes. I'll clean it up, add tests, and post it for review soon-ish if nobody else has already started working on this.
Patch for review:
https://reviews.llvm.org/D55515
Step 1:
https://reviews.llvm.org/rL349304
I improved handling of non-splat cases here: rL352283
Proposal for more psubus:
https://reviews.llvm.org/D60838
https://reviews.llvm.org/rL358999
That's the last step here AFAIK. The blendv vs. bitwise select issue for Jaguar is an independent bug.