Open Quuxplusone opened 6 years ago
Attached himeno.c
(6459 bytes, text/x-csrc): himeno.c source file
My first guess was that code alignment was a factor, but I tried aligning the AVX loop all the way up to 128-bytes, and I don't see any perf difference.
Any other explanations/experiments/ideas?
Here's the inner loop dump as SSE and AVX (built today via clang r323872):
SSE:
0000000100001750 mulss 0x41830(%r13,%rax,4), %xmm4
000000010000175a mulss 0x41834(%r13,%rax,4), %xmm1
0000000100001764 movaps %xmm5, %xmm9
0000000100001768 movaps %xmm11, %xmm8
000000010000176c movaps %xmm10, %xmm13
0000000100001770 addss %xmm4, %xmm1
0000000100001774 movss 0x10610(%rax,%r15), %xmm10
000000010000177e movss 0x41838(%r13,%rax,4), %xmm4
0000000100001788 mulss %xmm10, %xmm4
000000010000178d movss 0x20c14(%rax,%r15), %xmm5
0000000100001797 subss 0x2080c(%rax,%r15), %xmm5
00000001000017a1 subss 0x40c(%rax,%r15), %xmm5
00000001000017ab addss 0x4(%rax,%r15), %xmm5
00000001000017b2 addss %xmm1, %xmm4
00000001000017b6 mulss -0x8(%rdx,%r9), %xmm5
00000001000017bd movss 0x10814(%rax,%r15), %xmm1
00000001000017c7 movss 0x1040c(%rax,%r15), %xmm11
00000001000017d1 movaps %xmm1, %xmm3
00000001000017d4 subss %xmm11, %xmm3
00000001000017d9 subss 0x1080c(%rax,%r15), %xmm3
00000001000017e3 incq %r11
00000001000017e6 addss %xmm6, %xmm3
00000001000017ea mulss -0x4(%rdx,%r9), %xmm3
00000001000017f1 addss %xmm4, %xmm5
00000001000017f5 addss %xmm5, %xmm3
00000001000017f9 movss 0x20a14(%rax,%r15), %xmm4
0000000100001803 movss 0x20c(%rax,%r15), %xmm5
000000010000180d movaps %xmm4, %xmm6
0000000100001810 subss %xmm5, %xmm6
0000000100001814 subss 0x20a0c(%rax,%r15), %xmm6
000000010000181e addss %xmm7, %xmm6
0000000100001822 mulss (%rdx,%r9), %xmm6
0000000100001828 addss %xmm3, %xmm6
000000010000182c movss -0x8(%rcx,%r9), %xmm3
0000000100001833 mulss %xmm9, %xmm3
0000000100001838 addss %xmm6, %xmm3
000000010000183c movss -0x4(%rcx,%r9), %xmm6
0000000100001843 mulss %xmm8, %xmm6
0000000100001848 addss %xmm3, %xmm6
000000010000184c mulss (%rcx,%r9), %xmm2
0000000100001852 addss %xmm6, %xmm2
0000000100001856 addss 0x1060c(%rax,%rdi), %xmm2
000000010000185f mulss 0x4183c(%r13,%rax,4), %xmm2
0000000100001869 subss %xmm13, %xmm2
000000010000186e mulss 0x1060c(%rax,%r14), %xmm2
0000000100001878 movaps %xmm12, %xmm3
000000010000187c mulss %xmm2, %xmm3
0000000100001880 mulss %xmm2, %xmm2
0000000100001884 addss %xmm13, %xmm3
0000000100001889 movss %xmm3, 0x1060c(%rax,%r12)
0000000100001893 addss %xmm2, %xmm0
0000000100001897 addq $0x4, %rax
000000010000189b addq $0xc, %r9
000000010000189f movaps %xmm13, %xmm2
00000001000018a3 movaps %xmm9, %xmm7
00000001000018a7 movaps %xmm8, %xmm6
00000001000018ab cmpq %rbx, %r11
00000001000018ae jl 0x100001750
AVX:
0000000100001770 vmulss 0x41830(%r13,%rax,4), %xmm7, %xmm7
000000010000177a vmulss (%rcx,%r9), %xmm6, %xmm6
0000000100001780 vmovaps %xmm4, %xmm3
0000000100001784 vmovaps %xmm11, %xmm1
0000000100001788 vmovaps %xmm12, %xmm2
000000010000178c vmulss 0x41834(%r13,%rax,4), %xmm10, %xmm0
0000000100001796 vaddss %xmm0, %xmm7, %xmm4
000000010000179a vmovss 0x10610(%rax,%r15), %xmm12
00000001000017a4 vmulss 0x41838(%r13,%rax,4), %xmm12, %xmm5
00000001000017ae vmovss 0x20c14(%rax,%r15), %xmm7
00000001000017b8 vsubss 0x2080c(%rax,%r15), %xmm7, %xmm7
00000001000017c2 vsubss 0x40c(%rax,%r15), %xmm7, %xmm7
00000001000017cc vaddss 0x4(%rax,%r15), %xmm7, %xmm7
00000001000017d3 vmulss -0x8(%rdx,%r9), %xmm7, %xmm7
00000001000017da vaddss %xmm5, %xmm4, %xmm4
00000001000017de vaddss %xmm7, %xmm4, %xmm4
00000001000017e2 vmovss 0x10814(%rax,%r15), %xmm10
00000001000017ec vmovss 0x1040c(%rax,%r15), %xmm11
00000001000017f6 vsubss %xmm11, %xmm10, %xmm7
00000001000017fb vsubss 0x1080c(%rax,%r15), %xmm7, %xmm7
0000000100001805 vaddss %xmm8, %xmm7, %xmm7
000000010000180a vmulss -0x4(%rdx,%r9), %xmm7, %xmm7
0000000100001811 vaddss %xmm7, %xmm4, %xmm5
0000000100001815 vmovss 0x20a14(%rax,%r15), %xmm7
000000010000181f vmovss 0x20c(%rax,%r15), %xmm4
0000000100001829 vsubss %xmm4, %xmm7, %xmm0
000000010000182d vsubss 0x20a0c(%rax,%r15), %xmm0, %xmm0
0000000100001837 incq %r11
000000010000183a vaddss %xmm9, %xmm0, %xmm0
000000010000183f vmulss (%rdx,%r9), %xmm0, %xmm0
0000000100001845 vaddss %xmm0, %xmm5, %xmm0
0000000100001849 vmulss -0x8(%rcx,%r9), %xmm3, %xmm5
0000000100001850 vaddss %xmm5, %xmm0, %xmm0
0000000100001854 vmulss -0x4(%rcx,%r9), %xmm1, %xmm5
000000010000185b vaddss %xmm5, %xmm0, %xmm0
000000010000185f vaddss %xmm6, %xmm0, %xmm0
0000000100001863 vaddss 0x1060c(%rax,%rdi), %xmm0, %xmm0
000000010000186c vmulss 0x4183c(%r13,%rax,4), %xmm0, %xmm0
0000000100001876 vsubss %xmm2, %xmm0, %xmm0
000000010000187a vmulss 0x1060c(%rax,%r14), %xmm0, %xmm0
0000000100001884 vmulss %xmm0, %xmm0, %xmm5
0000000100001888 vmulss %xmm0, %xmm13, %xmm0
000000010000188c vaddss %xmm0, %xmm2, %xmm0
0000000100001890 vmovss %xmm0, 0x1060c(%rax,%r12)
000000010000189a vaddss %xmm5, %xmm14, %xmm14
000000010000189e addq $0x4, %rax
00000001000018a2 addq $0xc, %r9
00000001000018a6 vmovaps %xmm2, %xmm6
00000001000018aa vmovaps %xmm3, %xmm9
00000001000018ae vmovaps %xmm1, %xmm8
00000001000018b2 cmpq %rbx, %r11
00000001000018b5 jl 0x100001770
From a quick look over the asm, I think the issue is that Haswell can micro-fuse an indexed addressing mode with SSE MULSS (2-operand destructive destination), but not with AVX VMULSS.
IACA doesn't know this, and applies the Sandybridge rules micro-fusion / un-lamination rules, so it's output is wrong for Haswell.
See my answer on https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes for the full details (although I have an unfinished edit that I should really post after discovering more patterns in what fuses and what doesn't).
The TL:DR is that Haswell introduced support for 3-input uops so FMA can be single-uop. Micro-fusion takes advantage of this, allowing some micro-fused ALU + load uops with an indexed addressing mode to stay micro-fused instead of un-laminating at issue.
But the only instructions that can stay micro-fused are instructions like add
or paddd
or mulps
which have 2 operands and a read-write destination.
2-operand instructions with a write-only destination, like sqrtps (but not sqrtss), or pabsb but not paddb, will un-laminate, and so will 3-operand instructions even if the dest is the same as middle source operand.
All of the above applies to Skylake as well, BTW. I haven't found any micro-fusion differences between SKL and HSW, just from SnB to HSW.
And BTW, nobody has documented this anywhere else, AFAIK. Intel's optimization manual only mentions the SnB rules for un-lamination, without mentioning the HSW improvements. Agner Fog's guides don't even mention un-lamination at all.
(In reply to Peter Cordes from comment #4)
> And BTW, nobody has documented this anywhere else, AFAIK. Intel's
> optimization manual only mentions the SnB rules for un-lamination, without
> mentioning the HSW improvements. Agner Fog's guides don't even mention
> un-lamination at all.
Thanks, Peter!
Solving this bug should start with Intel fixing the docs and IACA. I'm not
going to take uarch-based shots-in-the-dark to try to solve this with compiler
hacks. If someone else wants to take that on, feel free.
(In reply to Sanjay Patel from comment #5)
> Solving this bug should start with Intel fixing the docs and IACA.
Yes, that would be nice :)
> I'm not
> going to take uarch-based shots-in-the-dark to try to solve this with
> compiler hacks. If someone else wants to take that on, feel free.
It might be a good idea to treat indexed addressing modes for non-
mov/movzx/movsx instructions as more expensive than normal if tuning
specifically for snb/ivb. (Pure load uops include broadcast-loads, though, and
don't need to avoid indexed addressing modes.)
And at least for VEX ALU instructions on later members of the SnB family; that
would be a reasonable approximation of when it's more expensive, even if it
misses PABSB and so on.
Note that stores have separate store-data/store-address uops that can micro-
fuse (except in some cases like vextracti128, or some ALU+store instructions on
some CPUs). And only non-indexed stores can use port7, so there can be
advantages to non-indexed stores even on Haswell (which can keep them micro-
fused).
In a copy+modify loop that can't / shouldn't fold the loads into an ALU
instruction (because the same data is needed multiple times), it can make sense
to address the src data *relative to the dst*, so the loads use indexed
addressing modes with (src_base - dst_base) in one register and current_dst in
another register to produce current_src. And current_dst is used directly for
stores. Then the loop overhead is only one pointer-increment + cmp / jb
against an end-pointer.
Anyway, indexed addressing modes are valuable tools for saving total uops,
especially when not unrolling, so we don't want to just increase their "cost".
To make optimal code, we're going to need a model that knows about which
instructions can use indexed addressing modes cheaply and which can't. If LLVM
can't currently do that, then we can start thinking about how to implement that
now, while waiting for Intel to document it better. (Or for me to clean up my
SO answer posting more of my test results for more instructions...)
The April 2018 version of the Intel Optimization Manual now has the information you are looking for in Section 2.3.5
(In reply to Craig Topper from comment #7)
> The April 2018 version of the Intel Optimization Manual now has the
> information you are looking for in Section 2.3.5
Thanks! But is this saying that unlamination is only a factor on Haswell uarch?
That contradicts Peter's statement in comment 4 about Skylake uarch.
Yeah I thought the placement in the manual was odd. I think SKL is the same as Haswell here. I think prior to Haswell, having an index always caused an unlamination. Now its dependent on the number of sources needed.
On Tue, Apr 10, 2018 at 11:26:43PM +0000, bugzilla-daemon@llvm.org wrote:
> https://bugs.llvm.org/show_bug.cgi?id=36180
>
> --- Comment #9 from Craig Topper <craig.topper@gmail.com> ---
> Yeah I thought the placement in the manual was odd. I think SKL is the same as
> Haswell here. I think prior to Haswell, having an index always caused an
> unlamination. Now its dependent on the number of sources needed.
Yes, HSW and SKL are the same in my testing, and it matches what Intel
has finally gotten around to documenting. :)
Except their terminology sucks: they could have said a 3 *operand*
max, instead of a 3 *source* max, because they're including a
write-only destination as a source! (Of course, if they were good at
terminology, they would have called it delamination. But apparently
using normal English words was un-possible.)
But anyway, total number of separate operands is a simple rule that
fits everything I tested.
I hadn't noticed the un-lamination for cmp reg,[base+idx] in my
testing, though. That's subtle. add reg,[base+idx] stays fused
because reg is a read-write operand, even though it does also write
flags. (I tested just now on SKL, and that's real: cmp un-laminates,
add stays fused.)
----
In the SKL/HSW/SnB section of Intel's optimization manual, stuff
mentioned for one uarch applies to later ones, too, unless overriden
by something in the section for a later uarch.
It's unfortunate that the decoders can't recognize the special case of
destination = first source as a single read-write operand to enable
macro-fusion for VEX/EVEX in cases like vmulps xmm1,xmm1,[b+idx]
But maybe that's also how the uop format differentiates legacy-SSE
from VEX/EVEX that zero-extend into the full vector reg.
himeno.c
(6459 bytes, text/x-csrc)