andreas-abel / uiCA

uops.info Code Analyzer
GNU Affero General Public License v3.0
230 stars 16 forks source link

Writeback conflicts #9

Open rygorous opened 3 years ago

rygorous commented 3 years ago

This is a case where uiCA predictons for SKL seem to be pretty far off. Pretty much all tools I know of get this one wrong, despite it only using reg-reg operations.

Test case: https://bit.ly/3jlvOOJ

uiCA predicts 4c/iteration throughput, actual observed throughput on a Skylake laptop (i7-6560U) is 6c/iteration. If you take out one instruction on the non-PSADBW critical path (say comment out the paddd xmm2, xmm3), this does run at 4c/iteration on real HW, and uiCA agrees.

The actual computation here is nonsense, I was just trying to come up with a small repro.

The case this is setting up into is two vector instructions with different latencies on the same port (p5 in this case) that would have to finish in the same cycle. They can't - the vector RF and bypass network can accept one result per port per cycle, no more, as far as I know. I do not know what the exact criteria are, nor why the penalty here is two cycles and not one. I do not know how often this occurs in practice but I do know that I have hit cases in the past where this seems to be a factor.