Atomic float addition is using a slow algorithm

ProjectPhysX / FluidX3D

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via OpenCL. Free for non-commercial use.

Other

3.81k stars 301 forks source link

The algorithm for atomic addition of floating point numbers used in the OpenCL source code is sub-optimal. There is a faster algorithm: The old function:

void atomic_add_f_slow(volatile global float* addr, const float val) { 
    union {
        uint  u32;
        float f32;
    } next, expected, current;
    current.f32 = *addr;
    do {
        next.f32 = (expected.f32=current.f32)+val;
        current.u32 = atomic_cmpxchg((volatile global uint*)addr, expected.u32, next.u32);
    } while(current.u32!=expected.u32);
}

The same can be achieved with the following function:

void atomic_add_f(volatile global float* addr, const float val) {
    float old = val;
    while ((old = atomic_xchg(addr, atomic_xchg(addr, 0.0f) + old)) != 0.0f);
}

(This code seems to originate from the following forum: https://forums.developer.nvidia.com/t/atomicadd-float-float-atomicmul-float-float/14639)

I have tested the speed of both functions in the following scenarios:

1000000 threads access the exact same address:
- Slow function kernel execution time: ~8200ms
- Fast function kernel execution time: ~2.6ms
1000000 threads each access a different address (In this scenario no atomic addition would be needed):
- Slow function kernel execution time: ~110μs
- Fast function kernel execution time: ~100μs

The old algorithm handles no conflicts between threads slightly faster but is up to an order of magnitude slower when conflicts between threads occur. Both algorithms produce the same results (with some floating point error of course).

Hi @PipInSpace,

thank you so much for this suggestion! I wasn't aware of this much faster fallback emulation for atomic floating-point addition. Ingenious to use one atomic_xchg to first swap in a 0.0f and then another atomic_xchg to swap with the sum until the return value is no longer 0.0f. It's fully functional and fully conformant with spec: surprisingly atomic_xchg allows its parameters to be type float. Of course the linked forum post is from the one and only psychocoder, developer of the legendary PIConGPU at HZDR!

Alongside the faster emulation I have also added hardware-supported atomic floating-point addition for Nvidia/Intel/AMD GPUs, which is even faster. Notice the evil \t in the PTX assembly line, so that the stringification macro + replacement rules don't replace the space with \n which would break the string literal with closing " in new line.

Feel free to grab the code for your experiments!

Kind regards, Moritz

ProjectPhysX / FluidX3D

Atomic float addition is using a slow algorithm #217