Those benchmarks as provided are doing ~85GB of writes to memory. If the first number were correct, I'd be getting 280GB/s of throughput to my RAM. Which is almost three times more than should be possible.
Disassembly of non-temporal thread-fn:
Upon disassembling we see that REPETITIONS loop got optimized away. This is in contrast to the temporal version that has this loop:
I can also artificially manipulate the benchmark results by recompiling with different REPETITIONS constant
REPETITIONS=20 (original value)
REPETITIONS=1
One way of fixing this (THAT I AM NOT SURE IS CORRECT) is to declare items as volatile Type. I'm too stupid to really argue about it's correctness, or the reason for the compiler eating away that loop only in non-temporal version, but:
REPETITIONS=20 with line 20 changed to void thread_fn(volatile Type* items, size_t size)
the asembly looks as follows (exactly as temporal, expect the mov instruction changed to movnti)
EDIT: removed wrong conclusion, since I've mistakenly assumed more is better in this benchmark. non-temporal are still superior
Hey, I've run the bandwidth-saturation on ryzen 9 5950x, debian bullseye and got some concerning results:
Those benchmarks as provided are doing ~85GB of writes to memory. If the first number were correct, I'd be getting 280GB/s of throughput to my RAM. Which is almost three times more than should be possible.
Disassembly of non-temporal thread-fn:
Upon disassembling we see that
REPETITIONS
loop got optimized away. This is in contrast to the temporal version that has this loop:I can also artificially manipulate the benchmark results by recompiling with different
REPETITIONS
constantREPETITIONS=20
(original value)REPETITIONS=1
One way of fixing this (THAT I AM NOT SURE IS CORRECT) is to declare
items
as volatile Type. I'm too stupid to really argue about it's correctness, or the reason for the compiler eating away that loop only in non-temporal version, but:REPETITIONS=20
with line 20 changed tovoid thread_fn(volatile Type* items, size_t size)
the asembly looks as follows (exactly as temporal, expect the
mov
instruction changed tomovnti
)EDIT: removed wrong conclusion, since I've mistakenly assumed more is better in this benchmark. non-temporal are still superior