Open Quuxplusone opened 4 years ago
Bugzilla Link | PR45808 |
Status | NEW |
Importance | P enhancement |
Reported by | Nemo Publius (nemo@self-evident.org) |
Reported on | 2020-05-05 13:03:37 -0700 |
Last modified on | 2020-07-01 14:18:55 -0700 |
Version | trunk |
Hardware | PC Linux |
CC | craig.topper@gmail.com, florian_hahn@apple.com, htmldeveloper@gmail.com, lebedev.ri@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com |
Fixed by commit(s) | rGfe6f5ba0bffd, rGb8a725274c22, rG3521ecf1f8a3 |
Attachments | |
Blocks | |
Blocked by | |
See also | PR42653 |
define <4 x i64> @_Z6minmaxDv4_xS_(<4 x i64> %0, <4 x i64> %1) {
%3 = icmp sgt <4 x i64> %0, %1
%4 = xor <4 x i1> %3, <i1 true, i1 true, i1 false, i1 false>
%5 = select <4 x i1> %4, <4 x i64> %0, <4 x i64> %1
ret <4 x i64> %5
}
(In reply to Simon Pilgrim from comment #1)
> define <4 x i64> @_Z6minmaxDv4_xS_(<4 x i64> %0, <4 x i64> %1) {
> %3 = icmp sgt <4 x i64> %0, %1
> %4 = xor <4 x i1> %3, <i1 true, i1 true, i1 false, i1 false>
So basically the code that is handling materialization of all-ones constants
as pcmpeq needs to be taught that if lower portion is all-ones and the rest
is zeros, it might still be profitable, i'm guessing?
> %5 = select <4 x i1> %4, <4 x i64> %0, <4 x i64> %1
> ret <4 x i64> %5
> }
We might need to improve PromoteMaskArithmetic to better handle selects.
CC'ing Florian who did a load of improvements in D72524.
(In reply to Roman Lebedev from comment #2)
> So basically the code that is handling materialization of all-ones constants
> as pcmpeq needs to be taught that if lower portion is all-ones and the rest
> is zeros, it might still be profitable, i'm guessing?
[Bug #42653] discusses something similar for rematerializable lower 'allones'
subvector masks once we avoid the unnecessary packss/pmovsx
rGfe6f5ba0bffd - added test case
rGb8a725274c22 - fixed PACKSS promotion issues
Current AVX2 Codegen:
.LCPI0_0:
.quad 1 # 0x1
.quad 1 # 0x1
.quad 0 # 0x0
.quad 0 # 0x0
_Z6minmaxDv4_xS_: # @_Z6minmaxDv4_xS_
vpcmpgtq %ymm1, %ymm0, %ymm2
vpxor .LCPI0_0(%rip), %ymm2, %ymm2
vpsllq $63, %ymm2, %ymm2
vblendvpd %ymm2, %ymm0, %ymm1, %ymm0
retq
(In reply to Simon Pilgrim from comment #5)
> rGfe6f5ba0bffd - added test case
> rGb8a725274c22 - fixed PACKSS promotion issues
>
> Current AVX2 Codegen:
>
> .LCPI0_0:
> .quad 1 # 0x1
> .quad 1 # 0x1
> .quad 0 # 0x0
> .quad 0 # 0x0
> _Z6minmaxDv4_xS_: # @_Z6minmaxDv4_xS_
> vpcmpgtq %ymm1, %ymm0, %ymm2
produces either -1 or 0
> vpxor .LCPI0_0(%rip), %ymm2, %ymm2
Inverts lowest bit only
> vpsllq $63, %ymm2, %ymm2
moves lowest bit into highest bit
> vblendvpd %ymm2, %ymm0, %ymm1, %ymm0
uses highest bit to control blending
Can't we get rid of the vpsllq by using -1 instead of 1 in xor?
> retq
That's the next step - there is plenty of code that tries to do that kind of thing - and nearly all of it ignores vectors :-)
Initial patch: https://reviews.llvm.org/D82257
This will make sure we're using -1/0 sign masks but doesn't materialize the constant using VPCMPEQ xmm (with implicit zeroing of the upper elements).
Current AVX2 Codegen:
.LCPI0_0:
.quad -1 # 0xffffffffffffffff
.quad -1 # 0xffffffffffffffff
.quad 0 # 0x0
.quad 0 # 0x0
_Z6minmaxDv4_xS_: # @_Z6minmaxDv4_xS_
vpcmpgtq %ymm1, %ymm0, %ymm2
vpxor .LCPI0_0(%rip), %ymm2, %ymm2
vblendvpd %ymm2, %ymm0, %ymm1, %ymm0
retq