andreas-abel / uiCA

uops.info Code Analyzer
GNU Affero General Public License v3.0
230 stars 16 forks source link

simulation inaccuracy: port assignment #20

Closed amonakov closed 2 years ago

amonakov commented 2 years ago

I was on the fence about reporting this; in the end I figured I should, because the inaccuracy is significant, and I suspect this might be an implementation bug in the Python script rather than a gap in reverse-engineering of port assignment algorithm.

Consider the following testcase (short link: https://bit.ly/3pu17dy)

1:
movzbl (%rdx),%eax
add    $0x2,%rdx
add    %ecx,%eax
movzbl -0x1(%rdx),%ecx
add    %eax,%edi
add    %eax,%ecx
add    %ecx,%edi
cmp    %rsi,%rdx
jne 1b

uica models this as 3 cycles per iteration for SNB, with the second instruction (add $0x2,%rdx) always going to port 5 and getting delayed by 1 cycle because port 5 is already occupied by fused cmp-jne from the previous issue group. The graph indicates that port 5 gets assigned 1.5x more instructions compared to ports 0 and 1.

In reality I'm seeing this loop run close to 2 cycles per iteration on SNB, and port assignment is quite even:

 Performance counter stats for './main':

         9,250,721       uops_dispatched_port.port_0
         9,620,564       uops_dispatched_port.port_1
         5,053,695       uops_dispatched_port.port_2
         5,056,202       uops_dispatched_port.port_3
            22,616       uops_dispatched_port.port_4
        11,392,109       uops_dispatched_port.port_5
        11,926,865      cycles
        45,291,707      instructions              #    3.80  insn per cycle

(the above is perf stat for the whole program, hence some overhead e.g. on port 4)

Exchanging the second and third instruction in the loop slows down the real execution, making port assignment less even, but improves the simulated result, making it more even.

andreas-abel commented 2 years ago

Note that the port usage and the throughput can in some cases depend significantly on the initial state of the hardware (e.g., previous instructions in the scheduler, the fill state of different buffers, the issue slot that the first instruction gets, whether the uops are immediately served from the DSB or the LSD or whether the first couple of iterations go through the decoders, etc.). The initial state during your measurements is unlikely exactly the same as the one that is assumed for the simulation.

Also, with 5,000,000 iterations, probably not all of the memory accesses are L1 cache hits, which can also influence the port assignment. If you replace the memory accesses with nops, you will probably get a result that is closer to the simulation.

with the second instruction (add $0x2,%rdx) always going to port 5

The second instruction does not always go to port 5, but only in around 68% of the cases. This is the output that I get for your link:

┌───────────────────────┬────────┬───────┬─────────────────────────────────────────────────────┬───────┐
│ MITE   MS   DSB   LSD │ Issued │ Exec. │ Port 0   Port 1   Port 2   Port 3   Port 4   Port 5 │ Notes │
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────┼───────┤
│                    1  │   1    │   1   │                     1                               │       │ movzx eax, byte ptr [rdx]
│                    1  │   1    │   1   │  0.11     0.21                                0.68  │       │ add rdx, 0x2
│                    1  │   1    │   1   │  0.53     0.4                                 0.07  │       │ add eax, ecx
│                    1  │   1    │   1   │                              1                      │       │ movzx ecx, byte ptr [rdx-0x1]
│                    1  │   1    │   1   │  0.41     0.31                                0.28  │       │ add edi, eax
│                    1  │   1    │   1   │  0.19     0.32                                0.49  │       │ add ecx, eax
│                    1  │   1    │   1   │  0.4      0.43                                0.17  │       │ add edi, ecx
│                    1  │   1    │   1   │                                                1    │       │ cmp rdx, rsi
│                       │        │       │                                                     │   M   │ jnz 0xffffffffffffffea
├───────────────────────┼────────┼───────┼─────────────────────────────────────────────────────┼───────┤
│                    8  │   8    │   8   │  1.63     1.67      1        1                2.7   │       │ Total
└───────────────────────┴────────┴───────┴─────────────────────────────────────────────────────┴───────┘
amonakov commented 2 years ago

Also, with 5,000,000 iterations, probably not all of the memory accesses are L1 cache hits

The program made 1000 walks of a 10-kilobyte array (plus some overhead), so the loop almost always hit in L1

(I am measuring cycles and instructions via perf_event_open, so I have two read syscalls bracketing the loop with 5000 iterations; I've added an outer loop with 1000 iterations as a quick'n'dirty way to retrieve port assignment stats)

If you replace the memory accesses with nops, you will probably get a result that is closer to the simulation

This indeed appears so, thanks (when replacing loads with zeroing idioms: https://bit.ly/3wdMYVB).

The second instruction does not always go to port 5, but only in around 68% of the cases

My bad, looking at trace table made that impression, and I completely forgot to double-check the table.

Thank you for the swift response! I'm not sure if to close or leave the issue. If you think the case is not interesting and simulation is working as intended, please close it for me.