Open Quuxplusone opened 5 years ago
Bugzilla Link | PR40686 |
Status | NEW |
Importance | P enhancement |
Reported by | Peter Cordes (peter@cordes.ca) |
Reported on | 2019-02-10 23:26:04 -0800 |
Last modified on | 2020-04-01 03:44:55 -0700 |
Version | trunk |
Hardware | PC Linux |
CC | craig.topper@gmail.com, htmldeveloper@gmail.com, lebedev.ri@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also | PR40685, PR42674 |
PR40685 is fixed now (https://godbolt.org/z/ern9kc), leaving us with
count(bool const*, int): pxor xmm0, xmm0 xor eax, eax movdqa xmm1, xmmword ptr [rip + .LCPI0_0] # xmm1 = <1,1,1,1,u,u,u,u,u,u,u,u,u,u,u,u> .LBB0_1: movd xmm2, dword ptr [rdi + rax] # xmm2 = mem[0],zero,zero,zero pxor xmm2, xmm1 pmovzxbd xmm2, xmm2 # xmm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero paddd xmm0, xmm2 add rax, 4 cmp rax, 100 jne .LBB0_1
... efficient pshufd hsum of dwords
ret
As discussed previously, special casing for small loop counts that make it impossible for paddb
to overflow could be a ~4x speedup, and allow a more efficient cleanup with psadbw for unsigned byte hsums. Or various other tricks for general-case lengths.
This introduces a new missed optimization vs. clang4.0.1, which used to use pmovzx with a memory source.
pmovzxbd xmm3, dword ptr [rdi + rax]
PMOVZX with a memory source can micro-fuse, saving a uop for the front-end on Sandybridge-family vs. movd load and a separate pmovzxbd xmm,xmm
. https://www.uops.info/html-tp/SKL/PMOVZXBD_XMM_M32-Measurements.html shows that its 2 uops can micro-fuse (RETIRE_SLOTS: 1.0) on Skylake for example. The YMM destination version can't for some reason (https://www.uops.info/html-instr/VPMOVZXBD_YMM_M64.html) but we should take advantage of the XMM version when we can. (Even with an indexed addressing mode for the non-VEX version, or for VEX 128-bit only with a non-indexed addressing mode.)
We fail to compress the XOR constant, still padding the .rodata vector with 12 bytes of zeros and loading with movdqa instead of movd
. So we have the worst of both worlds: extra uops in the loop and a larget constant.
Loading 8 or 16 bytes at a time with movq or movdqu to flip all of them with one pxor, then punpcklbw / hbw might even be a win. But probably not if we're widening all the way to int
: that would take 6 shuffle uops per 4 vectors of PADDD input vs. just 4. 4-byte loads are cheap and micro-fusion can get them through the front-end for free as part of a pmovzx, and pxor is also cheap.
So low/high unpacking against zero is probably only a win if we notice we can use paddw. And if we're going to do any noticing of anything like that, even better to notice we can psadbw -> paddd, effectively using psadbw as an unpack and hsum.