Closed joonsung-kim closed 3 years ago
Note that the perf tool runs the benchmark in user space. If you use the user-space version of nanoBench (i.e., use nanoBench.sh
instead of kernel-nanoBench.sh
), the results are very similar to perf.
I do not know why the uops don't come from the uop cache when running the benchmark in kernel space. However, I don't think that the measurements are incorrect.
@andreas-abel
Thanks. with user-mode nanoBench, it works correctly as I expected :). However, still, I can't figure out why kernel-mode nanoBench provides unexplainable results. (Personally, I prefer to use kernel-mode nanoBench to minimize extra overheads.)
Is there any plan to fix this issue in kernel-mode nanoBench?
I don't think there is anything to be fixed in nanoBench, as I don't think there is anything wrong. If you don't like how the CPU behaves in kernel mode, you would need to contact AMD ;)
Yes, I also think there seems to be nothing wrong with kernel-mode nanoBench. It would be better to contact AMD people. Thanks for your reply :)
Hi.
I have tried to measure the performance counters related to decoder parts (i.e., uops dispatched from legacy x86 decoder <
DeDisUopsFromDecoder.DecoderDispatched
> or micro-op cache <DeDisUopsFromDecoder.OpCacheDispatched
>). I have tested with a simple code snippet consisting of 8 multi-byte nops (each multi-byte nop is 4 bytes) without unrolling. I thought this code snippet results in a series of micro-op cache hits; however, the results show all uops are dispatched from the legacy x86 decoder, not micro-op cache.command
results (I slightly modified the source code to dump absolute measured counters)
I cannot understand why every instruction is decoded by the legacy x86 decoder.
I also checked with a simple test program consisting of the same code pattern (see below). test.s build command: <nasm -f elf64 test.s -o test.o; ld test.o -o test>
Then, I checked the performance counters with the perf tool.
The results show major uops are decoded by micro-op cache (r01AA => decoded by the legacy x86 decoder // r02AA => decoded by micro-op cache // r03AA => all uops).
Why nanoBench and perf show different results?
Sincerely. Joonsung Kim.