[x86] scalar FP code runs ~15% slower on Haswell when compiled with -mavx

Quuxplusone commented 6 years ago


Bugzilla Link	PR36180
Status	NEW
Importance	P enhancement
Reported by	Sanjay Patel (spatel+llvm@rotateright.com)
Reported on	2018-01-31 13:45:23 -0800
Last modified on	2018-04-13 04:17:39 -0700
Version	trunk
Hardware	PC All
CC	amjad.aboud@intel.com, andrei.elovikov@intel.com, craig.topper@gmail.com, david.l.kreitzer@intel.com, echristo@gmail.com, elena.demikhovsky@intel.com, evstupac@gmail.com, gadi.haber@intel.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, peter@cordes.ca, scanon@apple.com, zia.ansari@intel.com, zvirack@gmail.com
Fixed by commit(s)
Attachments	`himeno.c` (6459 bytes, text/x-csrc)
Blocks
Blocked by
See also	PR35681, PR23384

Created attachment 19783
himeno.c source file

I have a Haswell perf mystery that I can't explain. The himeno program (see
attachment) is an FP and memory benchmark that plows through large multi-
dimensional arrays doing 32-bit fadd/fsub/fmul.

To eliminate potentially questionable transforms and variation from the
vectorizers, build it as scalar-ops only like this:

$ ./clang -O2 himeno.c -fno-vectorize -fno-slp-vectorize  -o himeno_novec_sse
$ ./clang -O2 himeno.c -fno-vectorize -fno-slp-vectorize -mavx -o
himeno_novec_avx

And I'm testing on a 4.0GHz Haswell iMac running macOS 10.13.3:

$ ./himeno_novec_sse
mimax = 257 mjmax = 129 mkmax = 129
imax = 256 jmax = 128 kmax =128
cpu : 13.244777 sec.
Loop executed for 500 times
Gosa : 9.897132e-04
MFLOPS measured : 5175.818966
Score based on MMX Pentium 200MHz : 160.391043

$ ./himeno_novec_avx
mimax = 257 mjmax = 129 mkmax = 129
imax = 256 jmax = 128 kmax =128
cpu : 15.533612 sec.
Loop executed for 500 times
Gosa : 9.897132e-04
MFLOPS measured : 4413.176279
Score based on MMX Pentium 200MHz : 136.757864

There's an unfortunate amount of noise (~5%) in the perf on this system with
this benchmark, but these results are reproducible. I'm consistently seeing
~15% better perf with the non-AVX build.

If we look at the inner loop asm, they are virtually identical in terms of
operations. The SSE code just has a few extra instructions needed to copy
values because of the destructive ops, but the loads, stores, and math are the
same.

A IACA analysis of these loops says they should have virtually the same
throughput on HSW:

Block Throughput: 20.89 Cycles       Throughput Bottleneck: Backend
Loop Count:  22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5
|   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles | 13.0     0.0  | 21.0  | 12.0    12.0  | 12.0    11.0  |  1.0  |  2.0
|  2.0  |  0.0  |

Quuxplusone commented 6 years ago

Attached himeno.c (6459 bytes, text/x-csrc): himeno.c source file

Quuxplusone commented 6 years ago

My first guess was that code alignment was a factor, but I tried aligning the AVX loop all the way up to 128-bytes, and I don't see any perf difference.

Any other explanations/experiments/ideas?

Quuxplusone commented 6 years ago

Here's the inner loop dump as SSE and AVX (built today via clang r323872):

SSE:

0000000100001750    mulss   0x41830(%r13,%rax,4), %xmm4
000000010000175a    mulss   0x41834(%r13,%rax,4), %xmm1
0000000100001764    movaps  %xmm5, %xmm9
0000000100001768    movaps  %xmm11, %xmm8
000000010000176c    movaps  %xmm10, %xmm13
0000000100001770    addss   %xmm4, %xmm1
0000000100001774    movss   0x10610(%rax,%r15), %xmm10
000000010000177e    movss   0x41838(%r13,%rax,4), %xmm4
0000000100001788    mulss   %xmm10, %xmm4
000000010000178d    movss   0x20c14(%rax,%r15), %xmm5
0000000100001797    subss   0x2080c(%rax,%r15), %xmm5
00000001000017a1    subss   0x40c(%rax,%r15), %xmm5
00000001000017ab    addss   0x4(%rax,%r15), %xmm5
00000001000017b2    addss   %xmm1, %xmm4
00000001000017b6    mulss   -0x8(%rdx,%r9), %xmm5
00000001000017bd    movss   0x10814(%rax,%r15), %xmm1
00000001000017c7    movss   0x1040c(%rax,%r15), %xmm11
00000001000017d1    movaps  %xmm1, %xmm3
00000001000017d4    subss   %xmm11, %xmm3
00000001000017d9    subss   0x1080c(%rax,%r15), %xmm3
00000001000017e3    incq    %r11
00000001000017e6    addss   %xmm6, %xmm3
00000001000017ea    mulss   -0x4(%rdx,%r9), %xmm3
00000001000017f1    addss   %xmm4, %xmm5
00000001000017f5    addss   %xmm5, %xmm3
00000001000017f9    movss   0x20a14(%rax,%r15), %xmm4
0000000100001803    movss   0x20c(%rax,%r15), %xmm5
000000010000180d    movaps  %xmm4, %xmm6
0000000100001810    subss   %xmm5, %xmm6
0000000100001814    subss   0x20a0c(%rax,%r15), %xmm6
000000010000181e    addss   %xmm7, %xmm6
0000000100001822    mulss   (%rdx,%r9), %xmm6
0000000100001828    addss   %xmm3, %xmm6
000000010000182c    movss   -0x8(%rcx,%r9), %xmm3
0000000100001833    mulss   %xmm9, %xmm3
0000000100001838    addss   %xmm6, %xmm3
000000010000183c    movss   -0x4(%rcx,%r9), %xmm6
0000000100001843    mulss   %xmm8, %xmm6
0000000100001848    addss   %xmm3, %xmm6
000000010000184c    mulss   (%rcx,%r9), %xmm2
0000000100001852    addss   %xmm6, %xmm2
0000000100001856    addss   0x1060c(%rax,%rdi), %xmm2
000000010000185f    mulss   0x4183c(%r13,%rax,4), %xmm2
0000000100001869    subss   %xmm13, %xmm2
000000010000186e    mulss   0x1060c(%rax,%r14), %xmm2
0000000100001878    movaps  %xmm12, %xmm3
000000010000187c    mulss   %xmm2, %xmm3
0000000100001880    mulss   %xmm2, %xmm2
0000000100001884    addss   %xmm13, %xmm3
0000000100001889    movss   %xmm3, 0x1060c(%rax,%r12)
0000000100001893    addss   %xmm2, %xmm0
0000000100001897    addq    $0x4, %rax
000000010000189b    addq    $0xc, %r9
000000010000189f    movaps  %xmm13, %xmm2
00000001000018a3    movaps  %xmm9, %xmm7
00000001000018a7    movaps  %xmm8, %xmm6
00000001000018ab    cmpq    %rbx, %r11
00000001000018ae    jl  0x100001750

AVX:

0000000100001770    vmulss  0x41830(%r13,%rax,4), %xmm7, %xmm7
000000010000177a    vmulss  (%rcx,%r9), %xmm6, %xmm6
0000000100001780    vmovaps %xmm4, %xmm3
0000000100001784    vmovaps %xmm11, %xmm1
0000000100001788    vmovaps %xmm12, %xmm2
000000010000178c    vmulss  0x41834(%r13,%rax,4), %xmm10, %xmm0
0000000100001796    vaddss  %xmm0, %xmm7, %xmm4
000000010000179a    vmovss  0x10610(%rax,%r15), %xmm12
00000001000017a4    vmulss  0x41838(%r13,%rax,4), %xmm12, %xmm5
00000001000017ae    vmovss  0x20c14(%rax,%r15), %xmm7
00000001000017b8    vsubss  0x2080c(%rax,%r15), %xmm7, %xmm7
00000001000017c2    vsubss  0x40c(%rax,%r15), %xmm7, %xmm7
00000001000017cc    vaddss  0x4(%rax,%r15), %xmm7, %xmm7
00000001000017d3    vmulss  -0x8(%rdx,%r9), %xmm7, %xmm7
00000001000017da    vaddss  %xmm5, %xmm4, %xmm4
00000001000017de    vaddss  %xmm7, %xmm4, %xmm4
00000001000017e2    vmovss  0x10814(%rax,%r15), %xmm10
00000001000017ec    vmovss  0x1040c(%rax,%r15), %xmm11
00000001000017f6    vsubss  %xmm11, %xmm10, %xmm7
00000001000017fb    vsubss  0x1080c(%rax,%r15), %xmm7, %xmm7
0000000100001805    vaddss  %xmm8, %xmm7, %xmm7
000000010000180a    vmulss  -0x4(%rdx,%r9), %xmm7, %xmm7
0000000100001811    vaddss  %xmm7, %xmm4, %xmm5
0000000100001815    vmovss  0x20a14(%rax,%r15), %xmm7
000000010000181f    vmovss  0x20c(%rax,%r15), %xmm4
0000000100001829    vsubss  %xmm4, %xmm7, %xmm0
000000010000182d    vsubss  0x20a0c(%rax,%r15), %xmm0, %xmm0
0000000100001837    incq    %r11
000000010000183a    vaddss  %xmm9, %xmm0, %xmm0
000000010000183f    vmulss  (%rdx,%r9), %xmm0, %xmm0
0000000100001845    vaddss  %xmm0, %xmm5, %xmm0
0000000100001849    vmulss  -0x8(%rcx,%r9), %xmm3, %xmm5
0000000100001850    vaddss  %xmm5, %xmm0, %xmm0
0000000100001854    vmulss  -0x4(%rcx,%r9), %xmm1, %xmm5
000000010000185b    vaddss  %xmm5, %xmm0, %xmm0
000000010000185f    vaddss  %xmm6, %xmm0, %xmm0
0000000100001863    vaddss  0x1060c(%rax,%rdi), %xmm0, %xmm0
000000010000186c    vmulss  0x4183c(%r13,%rax,4), %xmm0, %xmm0
0000000100001876    vsubss  %xmm2, %xmm0, %xmm0
000000010000187a    vmulss  0x1060c(%rax,%r14), %xmm0, %xmm0
0000000100001884    vmulss  %xmm0, %xmm0, %xmm5
0000000100001888    vmulss  %xmm0, %xmm13, %xmm0
000000010000188c    vaddss  %xmm0, %xmm2, %xmm0
0000000100001890    vmovss  %xmm0, 0x1060c(%rax,%r12)
000000010000189a    vaddss  %xmm5, %xmm14, %xmm14
000000010000189e    addq    $0x4, %rax
00000001000018a2    addq    $0xc, %r9
00000001000018a6    vmovaps %xmm2, %xmm6
00000001000018aa    vmovaps %xmm3, %xmm9
00000001000018ae    vmovaps %xmm1, %xmm8
00000001000018b2    cmpq    %rbx, %r11
00000001000018b5    jl  0x100001770

Quuxplusone commented 6 years ago

From a quick look over the asm, I think the issue is that Haswell can micro-fuse an indexed addressing mode with SSE MULSS (2-operand destructive destination), but not with AVX VMULSS.

IACA doesn't know this, and applies the Sandybridge rules micro-fusion / un-lamination rules, so it's output is wrong for Haswell.

See my answer on https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes for the full details (although I have an unfinished edit that I should really post after discovering more patterns in what fuses and what doesn't).

The TL:DR is that Haswell introduced support for 3-input uops so FMA can be single-uop. Micro-fusion takes advantage of this, allowing some micro-fused ALU + load uops with an indexed addressing mode to stay micro-fused instead of un-laminating at issue.

But the only instructions that can stay micro-fused are instructions like add or paddd or mulps which have 2 operands and a read-write destination.

2-operand instructions with a write-only destination, like sqrtps (but not sqrtss), or pabsb but not paddb, will un-laminate, and so will 3-operand instructions even if the dest is the same as middle source operand.

Quuxplusone commented 6 years ago

All of the above applies to Skylake as well, BTW. I haven't found any micro-fusion differences between SKL and HSW, just from SnB to HSW.

And BTW, nobody has documented this anywhere else, AFAIK. Intel's optimization manual only mentions the SnB rules for un-lamination, without mentioning the HSW improvements. Agner Fog's guides don't even mention un-lamination at all.

Quuxplusone commented 6 years ago

(In reply to Peter Cordes from comment #4)
> And BTW, nobody has documented this anywhere else, AFAIK.  Intel's
> optimization manual only mentions the SnB rules for un-lamination, without
> mentioning the HSW improvements.  Agner Fog's guides don't even mention
> un-lamination at all.

Thanks, Peter!

Solving this bug should start with Intel fixing the docs and IACA. I'm not
going to take uarch-based shots-in-the-dark to try to solve this with compiler
hacks. If someone else wants to take that on, feel free.

Quuxplusone commented 6 years ago

(In reply to Sanjay Patel from comment #5)
> Solving this bug should start with Intel fixing the docs and IACA.

Yes, that would be nice :)

> I'm not
> going to take uarch-based shots-in-the-dark to try to solve this with
> compiler hacks. If someone else wants to take that on, feel free.

It might be a good idea to treat indexed addressing modes for non-
mov/movzx/movsx instructions as more expensive than normal if tuning
specifically for snb/ivb.  (Pure load uops include broadcast-loads, though, and
don't need to avoid indexed addressing modes.)

And at least for VEX ALU instructions on later members of the SnB family; that
would be a reasonable approximation of when it's more expensive, even if it
misses PABSB and so on.

Note that stores have separate store-data/store-address uops that can micro-
fuse (except in some cases like vextracti128, or some ALU+store instructions on
some CPUs).  And only non-indexed stores can use port7, so there can be
advantages to non-indexed stores even on Haswell (which can keep them micro-
fused).

In a copy+modify loop that can't / shouldn't fold the loads into an ALU
instruction (because the same data is needed multiple times), it can make sense
to address the src data *relative to the dst*, so the loads use indexed
addressing modes with (src_base - dst_base) in one register and current_dst in
another register to produce current_src.  And current_dst is used directly for
stores.  Then the loop overhead is only one pointer-increment + cmp / jb
against an end-pointer.

Anyway, indexed addressing modes are valuable tools for saving total uops,
especially when not unrolling, so we don't want to just increase their "cost".
To make optimal code, we're going to need a model that knows about which
instructions can use indexed addressing modes cheaply and which can't.  If LLVM
can't currently do that, then we can start thinking about how to implement that
now, while waiting for Intel to document it better.  (Or for me to clean up my
SO answer posting more of my test results for more instructions...)

Quuxplusone commented 6 years ago

The April 2018 version of the Intel Optimization Manual now has the information you are looking for in Section 2.3.5

Quuxplusone commented 6 years ago

(In reply to Craig Topper from comment #7)
> The April 2018 version of the Intel Optimization Manual now has the
> information you are looking for in Section 2.3.5

Thanks! But is this saying that unlamination is only a factor on Haswell uarch?
That contradicts Peter's statement in comment 4 about Skylake uarch.

Quuxplusone commented 6 years ago

Yeah I thought the placement in the manual was odd. I think SKL is the same as Haswell here. I think prior to Haswell, having an index always caused an unlamination. Now its dependent on the number of sources needed.

Quuxplusone commented 6 years ago

On Tue, Apr 10, 2018 at 11:26:43PM +0000, bugzilla-daemon@llvm.org wrote:
> https://bugs.llvm.org/show_bug.cgi?id=36180
>
> --- Comment #9 from Craig Topper <craig.topper@gmail.com> ---
> Yeah I thought the placement in the manual was odd. I think SKL is the same as
> Haswell here. I think prior to Haswell, having an index always caused an
> unlamination. Now its dependent on the number of sources needed.

Yes, HSW and SKL are the same in my testing, and it matches what Intel
has finally gotten around to documenting. :)

Except their terminology sucks: they could have said a 3 *operand*
max, instead of a 3 *source* max, because they're including a
write-only destination as a source!  (Of course, if they were good at
terminology, they would have called it delamination.  But apparently
using normal English words was un-possible.)

 But anyway, total number of separate operands is a simple rule that
fits everything I tested.

 I hadn't noticed the un-lamination for cmp reg,[base+idx] in my
testing, though.  That's subtle. add reg,[base+idx] stays fused
because reg is a read-write operand, even though it does also write
flags.  (I tested just now on SKL, and that's real: cmp un-laminates,
add stays fused.)

----

 In the SKL/HSW/SnB section of Intel's optimization manual, stuff
mentioned for one uarch applies to later ones, too, unless overriden
by something in the section for a later uarch.

It's unfortunate that the decoders can't recognize the special case of
destination = first source as a single read-write operand to enable
macro-fusion for VEX/EVEX in cases like vmulps xmm1,xmm1,[b+idx]

But maybe that's also how the uop format differentiates legacy-SSE
from VEX/EVEX that zero-extend into the full vector reg.

Quuxplusone / LLVMBugzillaTest

[x86] scalar FP code runs ~15% slower on Haswell when compiled with -mavx #35153