Wrong bandwidth-saturation benchmark results due to compiler optimisations

zbigos commented 2 years ago

Hey, I've run the bandwidth-saturation on ryzen 9 5950x, debian bullseye and got some concerning results:

./bandwidth-saturation 0 32 -> 224mS
./bandwidth-saturation 1 32 -> 1979mS

Those benchmarks as provided are doing ~85GB of writes to memory. If the first number were correct, I'd be getting 280GB/s of throughput to my RAM. Which is almost three times more than should be possible.

Disassembly of non-temporal thread-fn:

Upon disassembling we see that REPETITIONS loop got optimized away. This is in contrast to the temporal version that has this loop:

I can also artificially manipulate the benchmark results by recompiling with different REPETITIONS constant REPETITIONS=20 (original value)

REPETITIONS=1

One way of fixing this (THAT I AM NOT SURE IS CORRECT) is to declare items as volatile Type. I'm too stupid to really argue about it's correctness, or the reason for the compiler eating away that loop only in non-temporal version, but:

REPETITIONS=20 with line 20 changed to void thread_fn(volatile Type* items, size_t size)

the asembly looks as follows (exactly as temporal, expect the mov instruction changed to movnti)

EDIT: removed wrong conclusion, since I've mistakenly assumed more is better in this benchmark. non-temporal are still superior

jgarvin commented 2 years ago

@zbigos sorry to hijack, but what tool is making that cool disassembly view?

zbigos commented 2 years ago

@jgarvin it's cutter

Kobzol / hardware-effects

Wrong bandwidth-saturation benchmark results due to compiler optimisations #23