mov r32,same on Alder Lake, Zen

amonakov commented 2 years ago

This is a report regarding the uops.info table, specifically latency figures for in-place zero extension.

There are separate experiments for mov r32, <other> (latency 0) and mov r32, <same> (latency 1) on Intel CPUs starting from Ivy Bridge but excluding Alder Lake. It appears on Alder Lake the behavior is unchanged, in-place zero extension is not move-eliminated.

https://uops.info/table.html?search=mov_%20(r32%2C%20r32)&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_HSW=on&cb_ADLP=on&cb_ZEN2=on&cb_measurements=on&cb_doc=on&cb_base=on

https://uops.info/html-instr/MOV_8B_R32_R32.html https://uops.info/html-instr/MOV_89_R32_R32.html

My experiments indicate that AMD Zen 2 successfully eliminates in-place zero-extension, for example, the following runs at one cycle per iteration:

.loop:
        mov     eax, eax
        inc     rax
        dec     ecx
        jnz     .loop

Many thanks for making and maintaining this compendium.

andreas-abel commented 2 years ago

I'm not sure I understand what the issue here is.

There are separate experiments for the mov r32, <same> case on Alder Lake and Zen 2 (see https://uops.info/html-tp/ADL-P/MOV_89_R32_R32-Measurements.html#sameReg, https://uops.info/html-tp/ZEN2/MOV_89_R32_R32-Measurements.html#sameReg).

On the summary pages (e.g., https://uops.info/html-instr/MOV_8B_R32_R32.html), the "with the same register for different operands" entries are only added if the experiments show that there is indeed a difference in this case.

amonakov commented 2 years ago

I was looking at latency experiments under the "Latency operand 2 → 1" link, they don't show separate r32/same experiments:

https://uops.info/html-lat/ADL-P/MOV_89_R32_R32-Measurements.html https://uops.info/html-lat/ZEN2/MOV_89_R32_R32-Measurements.html

I don't quite understand how to read the throughput experiment log (or how it indicates zero latency).

Understood regarding Zen 2, but the issue regarding Alder Lake remains: the summary page seems to imply that move elimination always happens, while my understanding is the old limitation is still there and mov r32, same still has latency 1; for instance, the following loop runs at 4 cycles per iteration (3 for crc + 1 for mov):

.L3:
        mov     eax, eax
        add     rsi, 8
        crc32   rax, QWORD PTR [rsi-8]
        cmp     rdx, rsi
        jne     .L3

while this loop runs at 3 cycles per iteration:

.L3:
        add     rsi, 8
        crc32   rax, QWORD PTR [rsi-8]
        cmp     rdx, rsi
        jne     .L3

Sorry, I don't have access to Alder Lake, going by measurements for these loops made by someone else (exploring and optimizing different CRC implementations).

andreas-abel commented 2 years ago

I see now what you mean.

The latency experiments for the "same reg" case are only performed if the throughput tests suggest that there is indeed a difference, which is not the case here, as none of the moves are eliminated during the throughput tests (this can be seen from the UOPS_EXECUTED.THREAD lines).

I now did some further tests, and I found out that it depends on the instruction that last wrote the source whether 32-bit moves are eliminated.

For example, after or rbx, 1, mov eax, ebx is eliminated, but after add rbx, 1, it is not eliminated. If the immediate for the add is larger than 1023, the moves are eliminated. This is probably related to the optimization described here: https://twitter.com/uops_info/status/1473807584490672130

andreas-abel / nanoBench

mov r32,same on Alder Lake, Zen #26