Benchmark partialsums - worse perf with -march=haswell

Quuxplusone commented 4 years ago


Bugzilla Link	PR44115
Status	NEW
Importance	P enhancement
Reported by	David Bolvansky (david.bolvansky@gmail.com)
Reported on	2019-11-22 06:08:51 -0800
Last modified on	2019-11-27 08:05:15 -0800
Version	trunk
Hardware	PC Linux
CC	craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, peter@cordes.ca, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments	`44115.c` (1699 bytes, text/x-csrc)
Blocks
Blocked by
See also

Benchmark source: https://github.com/llvm/llvm-test-
suite/blob/master/SingleSource/Benchmarks/BenchmarkGame/partialsums.c

clang cc.c -Ofast -lm
0m0,133s

clang cc.c -Ofast -march=haswell -lm
0m0,146s

Codegen: https://godbolt.org/z/sZpeCQ

clang cc.c -Ofast -fno-unroll-loops -lm
0m0,129s

So, there is codegen and loop unrolling issue here..

Quuxplusone commented 4 years ago

There are many potential suspects here.

My first guess was that auto-vectorization with -march=haswell uses 256-bit
ops, and that leads to frequency throttling or other problems. But that doesn't
seem to make a difference. We can limit this with -mprefer-vector-width=128,
and it still shows a slowdown.

I haven't looked at the codegen yet, but an experiment suggests that FMA
codegen is to blame (testing on Haswell 4GHz, and I increased the loop count by
100x from the default code):

$ clang 44115.c -Ofast -mavx2 -mprefer-vector-width=128 && time ./a.out
user    0m7.676s

$ clang 44115.c -Ofast -mavx2 -mfma -mprefer-vector-width=128 && time ./a.out
user    0m8.389s

Quuxplusone commented 4 years ago

I was staring at the code and measuring the FMA vs. no-FMA code for a long time and couldn't see how the FMA version could be slower.

So I used Intel Power Gadget to check CPU frequency...sure enough, the no-FMA version of the program is running at ~%5 higher frequency (4.2GHz vs. 4.0GHz), so that explains most of the perf difference that I'm measuring. This is on an iMac running macOS 10.15.1.

Does anyone have a link to the current Intel docs/guidelines about vector/FP frequency throttling?

Quuxplusone commented 4 years ago

Just want to add some info:

gcc-trunk’s codegen for haswell contains FMA too, but I did not observe a regression wrt plain -Ofast.

Quuxplusone commented 4 years ago

... well, for gcc there is a quite small regression. Clang's gap is bigger.

Quuxplusone commented 4 years ago

i7 4720HQ

clang -Ofast -march=haswell -mno-fma cc.c  -lm
0m0,146s

clang -Ofast -march=haswell -mno-fma -mprefer-vector-width=128 cc.c  -lm
0m0,124s - matches -Ofast performance

clang -Ofast -march=haswell -mprefer-vector-width=128 cc.c  -lm
0m0,137s

Quuxplusone commented 4 years ago

(In reply to Sanjay Patel from comment #2)
> Does anyone have a link to the current Intel docs/guidelines about vector/FP
> frequency throttling?

I'm surprised mul/add would lead to less throttling that half that number of
FMA uops.

https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-
frequency has the best details I've seen for Xeon throttling, in @BeeOnRope's
answer.  Travis Downs knows what he's doing with low-level CPU performance
testing stuff, and the description of L0 vs. L1 and L2 frequency levels, and
high-throughput vs. low-throughput (e.g. latency-bound FMA), explain what I've
observed on Xeons.

I would *guess* that non-Xeon (aka "client") CPUs are at least similar.  My
desktop is an i7-6700k which never has to throttle for AVX2 instructions or
thermal limits, only based on how many cores are active.

An iMac might be dependent on thermal limits; careful to control for that in
benchmarking.  e.g. if you get it hot running 1 benchmark, it might be close to
running out of thermal headroom when you run the 2nd.

Quuxplusone commented 4 years ago

(In reply to Peter Cordes from comment #6)
> (In reply to Sanjay Patel from comment #2)
> > Does anyone have a link to the current Intel docs/guidelines about vector/FP
> > frequency throttling?
>
> I'm surprised mul/add would lead to less throttling that half that number of
> FMA uops.

Yes, I'm surprised too. I don't know why an FMA would trigger any frequency
limitations beyond the normul FMUL. But the results appear to be entirely
reproducible on my system. The system is relatively quiet except for the
process that I'm benchmarking, and I'm not hitting any thermal limits between
tests.

For reference, this is the Haswell model I'm testing with:
https://ark.intel.com/content/www/us/en/ark/products/80807/intel-core-i7-4790k-processor-8m-cache-up-to-4-40-ghz.html

> https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-
> frequency

Thanks for the link!

I thought Intel had an official guideline for this at some point, but I'm not
finding it now...

Quuxplusone commented 4 years ago

(In reply to David Bolvansky from comment #5)
> i7 4720HQ
>
> clang -Ofast -march=haswell -mno-fma cc.c  -lm
> 0m0,146s
>
> clang -Ofast -march=haswell -mno-fma -mprefer-vector-width=128 cc.c  -lm
> 0m0,124s - matches -Ofast performance
>
> clang -Ofast -march=haswell -mprefer-vector-width=128 cc.c  -lm
> 0m0,137s

I can approximately repro these results.

Note: I managed to reduce the benchmark so that a *single* FMA instruction is
the only difference in the entire thing, and that still shows the perf
difference and frequency throttling.

This is the timing for that reduction:

$ clang 44115.c -Ofast -march=haswell -o hsw && time ./hsw
user    0m4.133s (running at 4GHz)

$ clang 44115.c -Ofast -march=haswell -mno-fma -o hswnofma && time ./hswnofma
user    0m4.133s  (no difference; still running at 4.0GHz)

$ clang 44115.c -Ofast -march=haswell -mprefer-vector-width=128 -o fma128 &&
time ./fma128
user    0m2.758s   (big difference; still running at 4.0GHz)

$ clang 44115.c -Ofast -march=haswell -mno-fma -mprefer-vector-width=128 -o
nofma128 && time ./nofma128
user    0m2.546s   (best case; running at 4.2GHz)

So there are at least 3 factors in play:

1. Speed scaling (frequency throttling) based on FMA. I don't know what we can
do about that in the compiler. Create/set a CPU-model-based pref flag to avoid
FMA even though it's available? This would almost certainly cause regressions
on other benchmarks, so that's a dead end IMO.

2. Vectorization/codegen diffs due to 128-bit vs. 256-bit. This still needs
investigation. Conclusions may need to account for speed scaling difference
here as well.

3. Loop unrolling as mentioned in the original description. I have not looked
at that at all.

Quuxplusone commented 4 years ago

Attached 44115.c (1699 bytes, text/x-csrc): reduced benchmark source

Quuxplusone commented 4 years ago

Intel has some documentation in section 2.2.3 of the optimization manual at intel.com/sdm but it’s about Skylake. I don’t recall any documentation for Haswell.

Quuxplusone commented 4 years ago

(In reply to Craig Topper from comment #10)
> Intel has some documentation in section 2.2.3 of the optimization manual at
> intel.com/sdm  but it’s about Skylake. I don’t recall any documentation for
> Haswell.

Ah, yes - that's what I was remembering.

There's a table that differentiates "heavy" vs. "light" as AVX2 FP/FMA vs.
{AVX2 int or AVX128}.

No mention of special treatment for AVX128 FMA. But apparently, that qualifies
as Level 1 (same as AVX2 Heavy)...and yes, this on Haswell, so there really is
no doc for it.

Quuxplusone / LLVMBugzillaTest

Benchmark partialsums - worse perf with -march=haswell #43085