andreas-abel / nanoBench

A tool for running small microbenchmarks on recent Intel and AMD x86 CPUs.
http://www.uops.info
GNU Affero General Public License v3.0
435 stars 55 forks source link

Performance counters are not correctly measured in AMD ZEN series #14

Closed joonsung-kim closed 3 years ago

joonsung-kim commented 3 years ago

Hi.

I have tried to measure the performance counters related to decoder parts (i.e., uops dispatched from legacy x86 decoder <DeDisUopsFromDecoder.DecoderDispatched> or micro-op cache <DeDisUopsFromDecoder.OpCacheDispatched>). I have tested with a simple code snippet consisting of 8 multi-byte nops (each multi-byte nop is 4 bytes) without unrolling. I thought this code snippet results in a series of micro-op cache hits; however, the results show all uops are dispatched from the legacy x86 decoder, not micro-op cache.

command

sudo ./kernel-nanoBench.sh -basic_mode -unroll_count 1 -loop_count 100000 -cpu 1 -asm "nop ax; nop ax; nop ax; nop ax; nop ax; nop ax; nop ax; nop ax" -config configs/cfg_Zen_all.txt | grep -i "dedisuops"

results (I slightly modified the source code to dump absolute measured counters)

DeDisUopsFromDecoder.DecoderDispatched: 10.00 (1000019)
DeDisUopsFromDecoder.OpCacheDispatched: 0.00 (0)

I cannot understand why every instruction is decoded by the legacy x86 decoder.

I also checked with a simple test program consisting of the same code pattern (see below). test.s build command: <nasm -f elf64 test.s -o test.o; ld test.o -o test>

global _start

_start:
        mov rdi, 100000
        call test_uop_cache_hit
    mov rax, 60
    mov rdi, 0
    syscall

test_uop_cache_hit:
    nop ax
    nop ax
    nop ax
    nop ax
    nop ax
    nop ax
    nop ax
    nop ax

    dec rdi
    jnz test_uop_cache_hit
    ret

Then, I checked the performance counters with the perf tool.

$perf stat -e cycles,instructions,r01AA,r02AA,r03AA ./test

 Performance counter stats for './test':

            298349      cycles                                                      
           1037949      instructions              #    3.48  insn per cycle                                            
             86233      r01AA                                                       
            999280      r02AA                                                       
           1085721      r03AA                                                       

       0.000433346 seconds time elapsed

The results show major uops are decoded by micro-op cache (r01AA => decoded by the legacy x86 decoder // r02AA => decoded by micro-op cache // r03AA => all uops).

Why nanoBench and perf show different results?

Sincerely. Joonsung Kim.

andreas-abel commented 3 years ago

Note that the perf tool runs the benchmark in user space. If you use the user-space version of nanoBench (i.e., use nanoBench.sh instead of kernel-nanoBench.sh), the results are very similar to perf.

I do not know why the uops don't come from the uop cache when running the benchmark in kernel space. However, I don't think that the measurements are incorrect.

joonsung-kim commented 3 years ago

@andreas-abel

Thanks. with user-mode nanoBench, it works correctly as I expected :). However, still, I can't figure out why kernel-mode nanoBench provides unexplainable results. (Personally, I prefer to use kernel-mode nanoBench to minimize extra overheads.)

Is there any plan to fix this issue in kernel-mode nanoBench?

andreas-abel commented 3 years ago

I don't think there is anything to be fixed in nanoBench, as I don't think there is anything wrong. If you don't like how the CPU behaves in kernel mode, you would need to contact AMD ;)

joonsung-kim commented 3 years ago

Yes, I also think there seems to be nothing wrong with kernel-mode nanoBench. It would be better to contact AMD people. Thanks for your reply :)