Open Quuxplusone opened 4 years ago
There are many potential suspects here.
My first guess was that auto-vectorization with -march=haswell uses 256-bit
ops, and that leads to frequency throttling or other problems. But that doesn't
seem to make a difference. We can limit this with -mprefer-vector-width=128,
and it still shows a slowdown.
I haven't looked at the codegen yet, but an experiment suggests that FMA
codegen is to blame (testing on Haswell 4GHz, and I increased the loop count by
100x from the default code):
$ clang 44115.c -Ofast -mavx2 -mprefer-vector-width=128 && time ./a.out
user 0m7.676s
$ clang 44115.c -Ofast -mavx2 -mfma -mprefer-vector-width=128 && time ./a.out
user 0m8.389s
I was staring at the code and measuring the FMA vs. no-FMA code for a long time and couldn't see how the FMA version could be slower.
So I used Intel Power Gadget to check CPU frequency...sure enough, the no-FMA version of the program is running at ~%5 higher frequency (4.2GHz vs. 4.0GHz), so that explains most of the perf difference that I'm measuring. This is on an iMac running macOS 10.15.1.
Does anyone have a link to the current Intel docs/guidelines about vector/FP frequency throttling?
Just want to add some info:
gcc-trunk’s codegen for haswell contains FMA too, but I did not observe a regression wrt plain -Ofast.
... well, for gcc there is a quite small regression. Clang's gap is bigger.
i7 4720HQ
clang -Ofast -march=haswell -mno-fma cc.c -lm
0m0,146s
clang -Ofast -march=haswell -mno-fma -mprefer-vector-width=128 cc.c -lm
0m0,124s - matches -Ofast performance
clang -Ofast -march=haswell -mprefer-vector-width=128 cc.c -lm
0m0,137s
(In reply to Sanjay Patel from comment #2)
> Does anyone have a link to the current Intel docs/guidelines about vector/FP
> frequency throttling?
I'm surprised mul/add would lead to less throttling that half that number of
FMA uops.
https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-
frequency has the best details I've seen for Xeon throttling, in @BeeOnRope's
answer. Travis Downs knows what he's doing with low-level CPU performance
testing stuff, and the description of L0 vs. L1 and L2 frequency levels, and
high-throughput vs. low-throughput (e.g. latency-bound FMA), explain what I've
observed on Xeons.
I would *guess* that non-Xeon (aka "client") CPUs are at least similar. My
desktop is an i7-6700k which never has to throttle for AVX2 instructions or
thermal limits, only based on how many cores are active.
An iMac might be dependent on thermal limits; careful to control for that in
benchmarking. e.g. if you get it hot running 1 benchmark, it might be close to
running out of thermal headroom when you run the 2nd.
(In reply to Peter Cordes from comment #6)
> (In reply to Sanjay Patel from comment #2)
> > Does anyone have a link to the current Intel docs/guidelines about vector/FP
> > frequency throttling?
>
> I'm surprised mul/add would lead to less throttling that half that number of
> FMA uops.
Yes, I'm surprised too. I don't know why an FMA would trigger any frequency
limitations beyond the normul FMUL. But the results appear to be entirely
reproducible on my system. The system is relatively quiet except for the
process that I'm benchmarking, and I'm not hitting any thermal limits between
tests.
For reference, this is the Haswell model I'm testing with:
https://ark.intel.com/content/www/us/en/ark/products/80807/intel-core-i7-4790k-processor-8m-cache-up-to-4-40-ghz.html
> https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-
> frequency
Thanks for the link!
I thought Intel had an official guideline for this at some point, but I'm not
finding it now...
(In reply to David Bolvansky from comment #5)
> i7 4720HQ
>
> clang -Ofast -march=haswell -mno-fma cc.c -lm
> 0m0,146s
>
> clang -Ofast -march=haswell -mno-fma -mprefer-vector-width=128 cc.c -lm
> 0m0,124s - matches -Ofast performance
>
> clang -Ofast -march=haswell -mprefer-vector-width=128 cc.c -lm
> 0m0,137s
I can approximately repro these results.
Note: I managed to reduce the benchmark so that a *single* FMA instruction is
the only difference in the entire thing, and that still shows the perf
difference and frequency throttling.
This is the timing for that reduction:
$ clang 44115.c -Ofast -march=haswell -o hsw && time ./hsw
user 0m4.133s (running at 4GHz)
$ clang 44115.c -Ofast -march=haswell -mno-fma -o hswnofma && time ./hswnofma
user 0m4.133s (no difference; still running at 4.0GHz)
$ clang 44115.c -Ofast -march=haswell -mprefer-vector-width=128 -o fma128 &&
time ./fma128
user 0m2.758s (big difference; still running at 4.0GHz)
$ clang 44115.c -Ofast -march=haswell -mno-fma -mprefer-vector-width=128 -o
nofma128 && time ./nofma128
user 0m2.546s (best case; running at 4.2GHz)
So there are at least 3 factors in play:
1. Speed scaling (frequency throttling) based on FMA. I don't know what we can
do about that in the compiler. Create/set a CPU-model-based pref flag to avoid
FMA even though it's available? This would almost certainly cause regressions
on other benchmarks, so that's a dead end IMO.
2. Vectorization/codegen diffs due to 128-bit vs. 256-bit. This still needs
investigation. Conclusions may need to account for speed scaling difference
here as well.
3. Loop unrolling as mentioned in the original description. I have not looked
at that at all.
Attached 44115.c
(1699 bytes, text/x-csrc): reduced benchmark source
Intel has some documentation in section 2.2.3 of the optimization manual at intel.com/sdm but it’s about Skylake. I don’t recall any documentation for Haswell.
(In reply to Craig Topper from comment #10)
> Intel has some documentation in section 2.2.3 of the optimization manual at
> intel.com/sdm but it’s about Skylake. I don’t recall any documentation for
> Haswell.
Ah, yes - that's what I was remembering.
There's a table that differentiates "heavy" vs. "light" as AVX2 FP/FMA vs.
{AVX2 int or AVX128}.
No mention of special treatment for AVX128 FMA. But apparently, that qualifies
as Level 1 (same as AVX2 Heavy)...and yes, this on Haswell, so there really is
no doc for it.
44115.c
(1699 bytes, text/x-csrc)