If you look at 3 fused domain uop instructions (no memory operands) in Haswell, many have 1.0/2.0 for expected/measured throughput. Most of these are 1.0/1.0 on Skylake, as below:
Did the the instruction throughput actually improve so much in Skylake for these instructions? I don't think so!
The effect comes from a combination of factors. One is that nearly all of these tests run out of the MITE (legacy) decoder (as reported by the perf counters). This is mostly because the uops are "dense" enough that they exceed the 18 uops in 32 byte rule.
Then, decoding limitations on Haswell kick in. Haswell can't decode in a 3-1 pattern (but Skylake can), so the tests that interleave dependency breaking instructions with the payload instruction, like:
0: 48 31 c0 xor rax,rax
3: 41 f7 e0 mul r8d
end up taking 2 cycles to decode the two instructions. That's why itput of 2.0 appear all of over the place in the Haswell results. Most of the other test variants can't crack 2.0 cycles because of dependency chains.
One approach get getting close to the true throughput would be to avoid breaking out of the uop cache: e.g., by using an occasional large nop to space out the instructions. For cases where you have unrolled 4 uops with 4 dependency breaking instructions, you could group the dependency breaking (1 uop) and payload (complex) for better decoding. E.g.,:
If you look at 3 fused domain uop instructions (no memory operands) in Haswell, many have 1.0/2.0 for expected/measured throughput. Most of these are 1.0/1.0 on Skylake, as below:
Did the the instruction throughput actually improve so much in Skylake for these instructions? I don't think so!
The effect comes from a combination of factors. One is that nearly all of these tests run out of the MITE (legacy) decoder (as reported by the perf counters). This is mostly because the uops are "dense" enough that they exceed the 18 uops in 32 byte rule.
Then, decoding limitations on Haswell kick in. Haswell can't decode in a 3-1 pattern (but Skylake can), so the tests that interleave dependency breaking instructions with the payload instruction, like:
end up taking 2 cycles to decode the two instructions. That's why itput of 2.0 appear all of over the place in the Haswell results. Most of the other test variants can't crack 2.0 cycles because of dependency chains.
One approach get getting close to the true throughput would be to avoid breaking out of the uop cache: e.g., by using an occasional large nop to space out the instructions. For cases where you have unrolled 4 uops with 4 dependency breaking instructions, you could group the dependency breaking (1 uop) and payload (complex) for better decoding. E.g.,:
will decode more efficiently (5 cycles) than with full interleaving (8 cycles).