Status of atomic operations on different platforms

alpaka-group / alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:

https://alpaka.readthedocs.io

Mozilla Public License 2.0

356 stars 74 forks source link

Status of atomic operations on different platforms #1959

Open fwyzard opened 1 year ago

fwyzard commented 1 year ago

While reviewing the implementation of the atomic operations in SYCL, I started comparing what operations are available on CUDA, HIP and SYCL:

What operations should be supported by Alpaka ?

j-stephan commented 1 year ago

My gut feeling is that we should mirror / mimic the atomic operations defined by the C++ standard. I guess that would require an implementation of std::atomic_ref at some point, though.

bernhardmgruber commented 1 year ago

My gut feeling is that we should mirror / mimic the atomic operations defined by the C++ standard.

Uhhmm, ..., let's do a smaller subset :)

Given @fwyzard's table, I think almost all of the the listed columns and rows should be supported. The only weird one IMO is the inc/dec with range. But since we can emulate any atomic operation with a CAS, it's basically just a matter of development effort.

fwyzard commented 1 year ago

Uhhmm, ..., let's do a smaller subset :)

The [u]int8_t and [u]int16_t would be annoying to implement, because none of the GPU runtimes (CUDA, HIP, SYCL) has 8-bit or 16-bit atomics, so even the CAS loop would need some extra bit masking.

But in turn that means that there shouldn't be any GPU code that relies on them, so it should be safe enough to leave them out.

The only weird one IMO is the inc/dec with range.

Those are native to CUDA and HIP, so I think we should keep them, at least for uint32_t, which is the only type supported by CUDA and HIP.

We have some use for them, though most of the time we set the range to 0xffffffff. But they can be implemented more efficiently than atomicAdd(ptr, 1), so it's good to keep them.

psychocoderHPC commented 1 year ago

IMO as @fwyzard said we should take care that 32bit and 64bit methods are available because others are not supported on HIP/CUDA. Increment and decrement with range is nice to have but from our side, we never had a use-case for these functions.