If I'm not mistaken, macrofusion does not happen when op/jcc is cleanly split by a cache line boundary (See E.2.2.1 Legacy Decode Pipeline in Intel's June 2021 optimization manual). This matches the behavior of JCC erratum mitigation I see on my Skylake processor (DSB & LSD not disabled)
Since this scenario is not so uncommon, this may be a useful change to uiCA
Here's a sample output to illustrate using offsets 59, 60 and 61. Offset 60 prevents macrofusion, which then disables JCC erratum mitigation
~/perso/uiCA$ ./uiCA.py test.o -arch SKL -alignmentOffset 59
Throughput (in cycles per iteration): 2.00
Bottleneck: Front End (Predecoder)
J - Block not in DSB due to JCC erratum
M - Macro-fused with previous instruction
┌───────────────────────┬────────┬───────┬───────────────────────────────────────────────────────────────────────┬───────┐
│ MITE MS DSB LSD │ Issued │ Exec. │ Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 Port 6 Port 7 │ Notes │
├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┼───────┤
│ 1 │ 1 │ 1 │ 1 │ J │ add rax, 0x1
│ │ │ │ │ M │ jnz 0xfffffffffffffffc
├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┼───────┤
│ 1 │ 1 │ 1 │ 1 │ │ Total
└───────────────────────┴────────┴───────┴───────────────────────────────────────────────────────────────────────┴───────┘
~/perso/uiCA$ ./uiCA.py test.o -arch SKL -alignmentOffset 60
Throughput (in cycles per iteration): 1.00
Bottlenecks: Front End, Port 6
┌───────────────────────┬────────┬───────┬───────────────────────────────────────────────────────────────────────┐
│ MITE MS DSB LSD │ Issued │ Exec. │ Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 Port 6 Port 7 │
├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┤
│ 1 │ 1 │ 1 │ 0.33 0.33 0.33 │ add rax, 0x1
│ 1 │ 1 │ 1 │ 1 │ jnz 0xfffffffffffffffc
├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┤
│ 2 │ 2 │ 2 │ 0.33 0.33 0.33 1 │ Total
└───────────────────────┴────────┴───────┴───────────────────────────────────────────────────────────────────────┘
~/perso/uiCA$ ./uiCA.py test.o -arch SKL -alignmentOffset 61
Throughput (in cycles per iteration): 2.00
Bottleneck: Front End (Predecoder)
J - Block not in DSB due to JCC erratum
M - Macro-fused with previous instruction
┌───────────────────────┬────────┬───────┬───────────────────────────────────────────────────────────────────────┬───────┐
│ MITE MS DSB LSD │ Issued │ Exec. │ Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 Port 6 Port 7 │ Notes │
├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┼───────┤
│ 1 │ 1 │ 1 │ 1 │ J │ add rax, 0x1
│ │ │ │ │ M │ jnz 0xfffffffffffffffc
├───────────────────────┼────────┼───────┼───────────────────────────────────────────────────────────────────────┼───────┤
│ 1 │ 1 │ 1 │ 1 │ │ Total
└───────────────────────┴────────┴───────┴───────────────────────────────────────────────────────────────────────┴───────┘
If I'm not mistaken, macrofusion does not happen when op/jcc is cleanly split by a cache line boundary (See E.2.2.1 Legacy Decode Pipeline in Intel's June 2021 optimization manual). This matches the behavior of JCC erratum mitigation I see on my Skylake processor (DSB & LSD not disabled)
Since this scenario is not so uncommon, this may be a useful change to uiCA
Here's a sample output to illustrate using offsets 59, 60 and 61. Offset 60 prevents macrofusion, which then disables JCC erratum mitigation