andreas-abel / nanoBench

A tool for running small microbenchmarks on recent Intel and AMD x86 CPUs.
http://www.uops.info
GNU Affero General Public License v3.0
435 stars 55 forks source link

Gather and uops/port stats #4

Closed travisdowns closed 4 years ago

travisdowns commented 5 years ago

I'm not sure if this the place to file this issue: I didn't find a github page for uops.info specifically, but if there's a better place let me know.

There is something weird with port reporting for gather ops.

Consider VPGATHERDD for example. It is reported as 1*p0+3*p23+1*p5 but this page and other pages clearly show it sends 8 uops total to p23.

Also, this:

With blocking instructions for ports {2, 3}:

    Code:

       0:   c4 c1 7a 6f 56 40       vmovdqu xmm2,XMMWORD PTR [r14+0x40]
       6:   c4 c1 7a 6f 5e 40       vmovdqu xmm3,XMMWORD PTR [r14+0x40]
       c:   c4 c1 7a 6f 66 40       vmovdqu xmm4,XMMWORD PTR [r14+0x40]
      12:   c4 c1 7a 6f 6e 40       vmovdqu xmm5,XMMWORD PTR [r14+0x40]
      18:   c4 c1 7a 6f 76 40       vmovdqu xmm6,XMMWORD PTR [r14+0x40]
      1e:   c4 c1 7a 6f 7e 40       vmovdqu xmm7,XMMWORD PTR [r14+0x40]
      24:   c4 41 7a 6f 46 40       vmovdqu xmm8,XMMWORD PTR [r14+0x40]
      2a:   c4 41 7a 6f 4e 40       vmovdqu xmm9,XMMWORD PTR [r14+0x40]
      30:   c4 41 7a 6f 56 40       vmovdqu xmm10,XMMWORD PTR [r14+0x40]
      36:   c4 41 7a 6f 5e 40       vmovdqu xmm11,XMMWORD PTR [r14+0x40]
      3c:   c4 c1 7a 6f 56 40       vmovdqu xmm2,XMMWORD PTR [r14+0x40]
      42:   c4 c1 7a 6f 5e 40       vmovdqu xmm3,XMMWORD PTR [r14+0x40]
      48:   c4 c1 7a 6f 66 40       vmovdqu xmm4,XMMWORD PTR [r14+0x40]
      4e:   c4 c1 7a 6f 6e 40       vmovdqu xmm5,XMMWORD PTR [r14+0x40]
      54:   c4 c1 7a 6f 76 40       vmovdqu xmm6,XMMWORD PTR [r14+0x40]
      5a:   c4 c1 7a 6f 7e 40       vmovdqu xmm7,XMMWORD PTR [r14+0x40]
      60:   c4 41 7a 6f 46 40       vmovdqu xmm8,XMMWORD PTR [r14+0x40]
      66:   c4 41 7a 6f 4e 40       vmovdqu xmm9,XMMWORD PTR [r14+0x40]
      6c:   c4 41 7a 6f 56 40       vmovdqu xmm10,XMMWORD PTR [r14+0x40]
      72:   c4 41 7a 6f 5e 40       vmovdqu xmm11,XMMWORD PTR [r14+0x40]
      78:   c4 82 75 90 04 36       vpgatherdd ymm0,DWORD PTR [r14+ymm14*1],ymm1

    Init:

    VZEROALL;
    VPGATHERDD YMM0, [R14+YMM14], YMM1;
    VXORPS YMM14, YMM14, YMM14;
    VPGATHERDD YMM1, [R14+YMM14], YMM0

    warm_up_count: 100
    Show nanoBench command
    Results:
        Instructions retired: 21.00
        Core cycles: 15.00
        Reference cycles: 13.68
        UOPS_PORT2: 14.00
        UOPS_PORT3: 14.00

⇨ 3 μops that can only use ports {2, 3}

I don't understand the conclusion. I think the idea is that you have 20 instructions which send 1 uop each to p23, which would nominally execute in 10 cycles (20/2), and then you see much adding the instruction under test increases the runtime, assuming the bottleneck is port pressure. Here, you get to 15 cycles, a difference of 5 cycles. How does that equal 3 uops?

Thanks again for uops.info, it is great :).

travisdowns commented 5 years ago

I forgot to mention what I thing is the reason for the weird behavior of gather: where number of uops as counted by the uops_dispatched_port counts versus the other uops counters, like retire_slots and uops_issued: I think what happens it that a two uops is issued to p23 (maybe specifically one to each, or maybe it just schedules that way), and then these uops replay 4 times each (in the case of gatherdd) to load all 8 items (each time they execute they load 1 item and accumulate the results into a temporary register).

So these instructions have dispatched uop counts that vary a lot from the front-end and retire counts, much more than say micro-fused ops which have max 2:1 ratio.

andreas-abel commented 5 years ago

Thanks for reporting this.

I don't understand the conclusion. I think the idea is that you have 20 instructions which send 1 uop each to p23, which would nominally execute in 10 cycles (20/2), and then you see much adding the instruction under test increases the runtime, assuming the bottleneck is port pressure. Here, you get to 15 cycles, a difference of 5 cycles. How does that equal 3 uops?

The runtime is not used for reaching the conclusion; the tool uses only the UOPS performance counters. In particular, it first benchmarks the instruction in isolation using the UOPS_EXECUTED.THREAD (B1.01) counter (on Skylake). This is what is reported as "Number of μops" on http://uops.info/html-instr/VPGATHERDD_YMM_VSIB_YMM_YMM.html, which is 5 in this example. I was assuming that this value would correspond to the sum of the UOPS_DISPATCHED_PORT* counters if the corresponding ports are blocked by other instructions. The conclusion (3) is then reached by subtracting from the "Number of μops" count (5) the previous conclusions (in the example " One μop that can only use port 5" and "One μop that can only use port 0"), and then taking the minimum of the remaining uops and the uops on this port.

I think what happens it that a two uops is issued to p23 (maybe specifically one to each, or maybe it just schedules that way), and then these uops replay 4 times each (in the case of gatherdd) to load all 8 items (each time they execute they load 1 item and accumulate the results into a temporary register).

Interesting! I was not aware of these kinds of replays. Is there any reference for this?

I did notice before that when running, e.g., something like MOV [R14], RAX; mov RAX, [R14] in isolation, I get more than 4 uops on port 4, which seems to be caused by some form of uop replay. However, this goes away if the sequence is run together with something that blocks port 4.

travisdowns commented 5 years ago

Interesting! I was not aware of these kinds of replays. Is there any reference for this?

You could check the patents (I haven't done it yet), but not that I'm aware of, specifically for gather.

Uops that depend on a load replay in the case they miss in the L1: they first dispatch optimistically when the load is expected to come back from the L1, but if it misses, they will replay ~7 cycles later when the load is expected to come back from the L2 (L1 and L2 have fixed latencies), but if that fails they will wait until the load finally comes back, so can replay three times in that scenario.

They would also replay in various cases of store forwarding speculation as in your example: e.g., for a store hitting a load as in your case, if the store data wasn't available, they would replay later when the store is available, etc. The "dispatched" counters count each time an operation tries to execute, and I guess the "executed" counter maybe only counts successful executions (i.e., doens't count uops that will be replayed).

However, I am not aware of any discussion of gathers working this way, and it is different to the above examples: here the loads are probably succeeding, so the replay (if it exists) is not due to failure but due to more work to be done (more elements). Maybe it is not replayed at all but rather just 8 uops being issued, and the "executed" counter doesn't count them, I dunno.

travisdowns commented 5 years ago

Question: many of the pages have a counter simply called UOPS: but what counter is that exactly?

andreas-abel commented 5 years ago

Question: many of the pages have a counter simply called UOPS: but what counter is that exactly?

From Nehalem to Broadwell, it is UOPS_RETIRED.ALL (C2.01). From Skylake onwards, it is UOPS_EXECUTED.THREAD (B1.01).

travisdowns commented 5 years ago

Thanks @andreas-abel .

I don't know if you want to use hardcoded rules, but for this type of problem you could hard-code the rule that load ops only ever execute on 2,3, never anywhere else. So you can simply count the number of ops on 2,3. When stores are involved it is more complicated because the AGU part of stores can use p23 and sometimes p7.

andreas-abel commented 4 years ago

This should be fixed with the latest update.

travisdowns commented 4 years ago

Thanks @andreas-abel .

If you can share, what was the fix?

andreas-abel commented 4 years ago

I removed the restriction that the sum of the uops on the ports cannot be larger than UOPS_EXECUTED.THREAD.

However, for instructions with a lock prefix, I added a restriction that they cannot use more than one uop on the store data port. According to IACA, this is the case, but in measurements, they often use more than one, which is likely due to replays. However, they also use more uops on p0156 as they should; I'm not really sure what to do about these cases.