ProjectPhysX / FluidX3D

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via OpenCL. Free for non-commercial use.
https://youtube.com/@ProjectPhysX
Other
3.81k stars 301 forks source link

Atomic float addition is using a slow algorithm #217

Closed PipInSpace closed 4 days ago

PipInSpace commented 2 weeks ago

The algorithm for atomic addition of floating point numbers used in the OpenCL source code is sub-optimal. There is a faster algorithm: The old function:

void atomic_add_f_slow(volatile global float* addr, const float val) { 
    union {
        uint  u32;
        float f32;
    } next, expected, current;
    current.f32 = *addr;
    do {
        next.f32 = (expected.f32=current.f32)+val;
        current.u32 = atomic_cmpxchg((volatile global uint*)addr, expected.u32, next.u32);
    } while(current.u32!=expected.u32);
}

The same can be achieved with the following function:

void atomic_add_f(volatile global float* addr, const float val) {
    float old = val;
    while ((old = atomic_xchg(addr, atomic_xchg(addr, 0.0f) + old)) != 0.0f);
}

(This code seems to originate from the following forum: https://forums.developer.nvidia.com/t/atomicadd-float-float-atomicmul-float-float/14639)

I have tested the speed of both functions in the following scenarios:

  1. 1000000 threads access the exact same address:
    • Slow function kernel execution time: ~8200ms
    • Fast function kernel execution time: ~2.6ms
  2. 1000000 threads each access a different address (In this scenario no atomic addition would be needed):
    • Slow function kernel execution time: ~110μs
    • Fast function kernel execution time: ~100μs

The old algorithm handles no conflicts between threads slightly faster but is up to an order of magnitude slower when conflicts between threads occur. Both algorithms produce the same results (with some floating point error of course).

ProjectPhysX commented 4 days ago

Hi @PipInSpace,

thank you so much for this suggestion! I wasn't aware of this much faster fallback emulation for atomic floating-point addition. Ingenious to use one atomic_xchg to first swap in a 0.0f and then another atomic_xchg to swap with the sum until the return value is no longer 0.0f. It's fully functional and fully conformant with spec: surprisingly atomic_xchg allows its parameters to be type float. Of course the linked forum post is from the one and only psychocoder, developer of the legendary PIConGPU at HZDR!

Alongside the faster emulation I have also added hardware-supported atomic floating-point addition for Nvidia/Intel/AMD GPUs, which is even faster. Notice the evil \t in the PTX assembly line, so that the stringification macro + replacement rules don't replace the space with \n which would break the string literal with closing " in new line.

Feel free to grab the code for your experiments!

Kind regards, Moritz