Closed amonakov closed 2 years ago
Note that the port usage and the throughput can in some cases depend significantly on the initial state of the hardware (e.g., previous instructions in the scheduler, the fill state of different buffers, the issue slot that the first instruction gets, whether the uops are immediately served from the DSB or the LSD or whether the first couple of iterations go through the decoders, etc.). The initial state during your measurements is unlikely exactly the same as the one that is assumed for the simulation.
Also, with 5,000,000 iterations, probably not all of the memory accesses are L1 cache hits, which can also influence the port assignment. If you replace the memory accesses with nops, you will probably get a result that is closer to the simulation.
with the second instruction (add $0x2,%rdx) always going to port 5
The second instruction does not always go to port 5, but only in around 68% of the cases. This is the output that I get for your link:
┌───────────────────────┬────────┬───────┬─────────────────────────────────────────────────────┬───────┐
│ MITE MS DSB LSD │ Issued │ Exec. │ Port 0 Port 1 Port 2 Port 3 Port 4 Port 5 │ Notes │
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────┼───────┤
│ 1 │ 1 │ 1 │ 1 │ │ movzx eax, byte ptr [rdx]
│ 1 │ 1 │ 1 │ 0.11 0.21 0.68 │ │ add rdx, 0x2
│ 1 │ 1 │ 1 │ 0.53 0.4 0.07 │ │ add eax, ecx
│ 1 │ 1 │ 1 │ 1 │ │ movzx ecx, byte ptr [rdx-0x1]
│ 1 │ 1 │ 1 │ 0.41 0.31 0.28 │ │ add edi, eax
│ 1 │ 1 │ 1 │ 0.19 0.32 0.49 │ │ add ecx, eax
│ 1 │ 1 │ 1 │ 0.4 0.43 0.17 │ │ add edi, ecx
│ 1 │ 1 │ 1 │ 1 │ │ cmp rdx, rsi
│ │ │ │ │ M │ jnz 0xffffffffffffffea
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────┼───────┤
│ 8 │ 8 │ 8 │ 1.63 1.67 1 1 2.7 │ │ Total
└───────────────────────┴────────┴───────┴─────────────────────────────────────────────────────┴───────┘
Also, with 5,000,000 iterations, probably not all of the memory accesses are L1 cache hits
The program made 1000 walks of a 10-kilobyte array (plus some overhead), so the loop almost always hit in L1
(I am measuring cycles and instructions via perf_event_open, so I have two read syscalls bracketing the loop with 5000 iterations; I've added an outer loop with 1000 iterations as a quick'n'dirty way to retrieve port assignment stats)
If you replace the memory accesses with nops, you will probably get a result that is closer to the simulation
This indeed appears so, thanks (when replacing loads with zeroing idioms: https://bit.ly/3wdMYVB).
The second instruction does not always go to port 5, but only in around 68% of the cases
My bad, looking at trace table made that impression, and I completely forgot to double-check the table.
Thank you for the swift response! I'm not sure if to close or leave the issue. If you think the case is not interesting and simulation is working as intended, please close it for me.
I was on the fence about reporting this; in the end I figured I should, because the inaccuracy is significant, and I suspect this might be an implementation bug in the Python script rather than a gap in reverse-engineering of port assignment algorithm.
Consider the following testcase (short link: https://bit.ly/3pu17dy)
uica models this as 3 cycles per iteration for SNB, with the second instruction (
add $0x2,%rdx
) always going to port 5 and getting delayed by 1 cycle because port 5 is already occupied by fused cmp-jne from the previous issue group. The graph indicates that port 5 gets assigned 1.5x more instructions compared to ports 0 and 1.In reality I'm seeing this loop run close to 2 cycles per iteration on SNB, and port assignment is quite even:
(the above is
perf stat
for the whole program, hence some overhead e.g. on port 4)Exchanging the second and third instruction in the loop slows down the real execution, making port assignment less even, but improves the simulated result, making it more even.