Complex instruction throughput often understimated in Haswell

travisdowns commented 3 years ago

If you look at 3 fused domain uop instructions (no memory operands) in Haswell, many have 1.0/2.0 for expected/measured throughput. Most of these are 1.0/1.0 on Skylake, as below:

Did the the instruction throughput actually improve so much in Skylake for these instructions? I don't think so!

The effect comes from a combination of factors. One is that nearly all of these tests run out of the MITE (legacy) decoder (as reported by the perf counters). This is mostly because the uops are "dense" enough that they exceed the 18 uops in 32 byte rule.

Then, decoding limitations on Haswell kick in. Haswell can't decode in a 3-1 pattern (but Skylake can), so the tests that interleave dependency breaking instructions with the payload instruction, like:

  0:    48 31 c0                xor    rax,rax
  3:    41 f7 e0                mul    r8d

end up taking 2 cycles to decode the two instructions. That's why itput of 2.0 appear all of over the place in the Haswell results. Most of the other test variants can't crack 2.0 cycles because of dependency chains.

One approach get getting close to the true throughput would be to avoid breaking out of the uop cache: e.g., by using an occasional large nop to space out the instructions. For cases where you have unrolled 4 uops with 4 dependency breaking instructions, you could group the dependency breaking (1 uop) and payload (complex) for better decoding. E.g.,:

xor eax, eax
xor ebx, ebx
xor ecx, ecx
xor edx, edx
xadd eax, eax
xadd ebx, ebx
xadd ecx, ecx
xadd edx, edx

will decode more efficiently (5 cycles) than with full interleaving (8 cycles).

andreas-abel commented 3 years ago

This should be mostly fixed now.

travisdowns commented 3 years ago

Awesome!

andreas-abel / nanoBench

Complex instruction throughput often understimated in Haswell #18