Closed PipInSpace closed 4 days ago
Hi @PipInSpace,
thank you so much for this suggestion! I wasn't aware of this much faster fallback emulation for atomic floating-point addition. Ingenious to use one atomic_xchg
to first swap in a 0.0f
and then another atomic_xchg
to swap with the sum until the return value is no longer 0.0f
. It's fully functional and fully conformant with spec: surprisingly atomic_xchg
allows its parameters to be type float
.
Of course the linked forum post is from the one and only psychocoder, developer of the legendary PIConGPU at HZDR!
Alongside the faster emulation I have also added hardware-supported atomic floating-point addition for Nvidia/Intel/AMD GPUs, which is even faster. Notice the evil \t
in the PTX assembly line, so that the stringification macro + replacement rules don't replace the space with \n
which would break the string literal with closing "
in new line.
Feel free to grab the code for your experiments!
Kind regards, Moritz
The algorithm for atomic addition of floating point numbers used in the OpenCL source code is sub-optimal. There is a faster algorithm: The old function:
The same can be achieved with the following function:
(This code seems to originate from the following forum: https://forums.developer.nvidia.com/t/atomicadd-float-float-atomicmul-float-float/14639)
I have tested the speed of both functions in the following scenarios:
The old algorithm handles no conflicts between threads slightly faster but is up to an order of magnitude slower when conflicts between threads occur. Both algorithms produce the same results (with some floating point error of course).