Atomics design doc and discussion

RDambrosio016 commented 2 years ago

This issue serves as a design document and a discussion on how atomics will/should be implemented.

CUDA Background

CUDA has had atomics for basically forever in the form of a few functions like atomicAdd, atomicCAS, etc. See the docs on it here. It also has _system and _block variants of them.

This has always been the overwhelmingly popular way of doing atomic things in CUDA, and for a while it was the only way, until compute 7.x. sm_70 introduced the .sem qualifier on the atom PTX instruction. This allowed users to specify a specific ordering for atomic operations.

CUDA decided to implement this by replicating std::atomic as its own thing called cuda::std::atomic. Atomic provides a generic container for atomic operations on types such as int. It offers atomic operations with user-specified orderings.

Usage of `cuda::std::atomic`

Despite NVIDIA pushing for users to use atomic, it has not seen wide adoption, presumably because of the following reasons:

cuda::std::atomic is a mess of templates and inheritance because CUDA wanted to make it compatible with the GPU, the CPU (with every compiler's weird atomic semantics), and user-defined functions. This yields weird errors and confusing dependency graphs.
Every CUDA example, sample, docs, tutorials, course, etc uses atomicAdd and similar. Unless you are deeply knowledgeable about CUDA you would not switch to atomic, if you even knew it existed.
atomic has had a rocky past in terms of it sometimes working or not working, for example, CUDA 10.2 had many issues with std::atomic
atomic for some reason does not support float add, i am totally unsure why, the PTX ISA has instructions for it

Importance of great atomics

Atomics are the core of many algorithms, therefore it is imperative for a project of this scale to implement them once and implement them well. Otherwise a poor implementation of them might mean users being stuck with such an implementation forever, as with CUDA's case. Therefore, i believe we should take our time with atomics and implement them once and do it well.

Low level implementation

The low level implementation of such atomics is not very difficult, it can mostly be taken from how cuda::std::atomic does it at the low level. It implements them in the following way:

If the CUDA Arch is >= 7.0 then it uses specialized PTX instructions with asm:

template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_acq_rel_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.acq_rel.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_acquire_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.acquire.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_relaxed_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.relaxed.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_release_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.release.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_volatile_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }

With seqcst additionally containing a fence before it:

        switch (__memorder) {
          case __ATOMIC_SEQ_CST: __cuda_fence_sc_device();
          case __ATOMIC_CONSUME:
          case __ATOMIC_ACQUIRE: __cuda_fetch_add_acquire_32_device(__ptr, __tmp, __tmp); break;
          case __ATOMIC_ACQ_REL: __cuda_fetch_add_acq_rel_32_device(__ptr, __tmp, __tmp); break;
          case __ATOMIC_RELEASE: __cuda_fetch_add_release_32_device(__ptr, __tmp, __tmp); break;
          case __ATOMIC_RELAXED: __cuda_fetch_add_relaxed_32_device(__ptr, __tmp, __tmp); break;
          default: assert(0);
        }

This can very easily be replicated by us since we have full support for inline asm.

Otherwise, if the arch is less than 7.0, it "emulates" it with barriers:

        switch (__memorder) {
          case __ATOMIC_SEQ_CST:
          case __ATOMIC_ACQ_REL: __cuda_membar_device();
          case __ATOMIC_CONSUME:
          case __ATOMIC_ACQUIRE: __cuda_fetch_add_volatile_32_device(__ptr, __tmp, __tmp); __cuda_membar_device(); break;
          case __ATOMIC_RELEASE: __cuda_membar_device(); __cuda_fetch_add_volatile_32_device(__ptr, __tmp, __tmp); break;
          case __ATOMIC_RELAXED: __cuda_fetch_add_volatile_32_device(__ptr, __tmp, __tmp); break;
          default: assert(0);
        }

You can find the code for this in CUDA_ROOT\include\cuda\std\detail\libcxx\include\support\atomic\atomic_cuda_generated.h for CUDA 11.5, and CUDA_ROOT\include\cuda\std\detail\__atomic_generated for older versions.

That file provides functions as intrinsics that the rest of libcu++ build off of:

template<class _Type, typename cuda::std::enable_if<sizeof(_Type)==4, int>::type = 0>
__device__ _Type __atomic_fetch_add_cuda(volatile _Type *__ptr, _Type __val, int __memorder, __thread_scope_device_tag) {
  /* ... */
}

Rust Intrinsic implementation

I propose we follow a similar approach of raw unsafe intrinsics for:

32 bit and 64 bit operations, loads, stores, compareexchange, exchange, and fetch{add, and, max, min, or, sub, xor}
block, device, and system operations

sm_70+ intrinsics are implemented in cuda_std::atomic::intrinsics::sm_70, emulated intrinsics are in cuda_std;:atomic::intrinsics::emulated.

Wrappers of the sm-specific intrinsics are in cuda_std::atomic::intrinsics. For example:

pub unsafe fn atomic_fetch_add_f32_device(ptr: *const f32, ordering: Ordering, val: f32) -> f32;

High level types

And finally, we expose high level types in cuda_std::atomic such as AtomicF32, AtomicF64, etc.

Block atomics (BlockAtomicF32) will need to be unsafe, this is because for device atomics, it is up to the caller of the kernels to ensure buffers and kernels do not contain data races, and systems prevent this. However, block atomics do not, it would be very easy to accidentally cause data races if the accesses are not intra-threadblock.

Atomic types will expose operations that they specifically allow, for example, per the ISA spec:

Every type has fetch_and, fetch_or, fetch_xor, compare_and_swap, and exchange.
Signed and unsigned Integers have fetch_add, fetch_inc, fetch_dec, fetch_min, and fetch_max.
Unsigned integers have fetch_inc and fetch_add that clamp to [0..b] (unsure if this means 0..MAX or something else).
Floats have fetch_add

Compatibility with core atomics

Core exposes atomics with a couple of things:

Every target has a cfg on whether an atomic width is supported on the target. I have not checked what atomic sizes core thinks nvptx has.
The codegen then generates atomic instructions as such:

    fn atomic_load(
        &mut self,
        ty: &'ll Type,
        ptr: &'ll Value,
        _order: AtomicOrdering,
        _size: Size,
    ) -> &'ll Value {
        /* ... */
    }

In addition to atomic_store, atomic_rmw, atomic_cmpxchg, and a couple more. We currently trap in all of these functions, partly because libnvvm doesn't support atomic instructions for many types, and partly because we want to bikeshed how to implement them nicely.

However, as expected, things are not quite the same on the CPU and the GPU, there are some very important differences:

CUDA has 32 bit and 64 bit atomics (16 bit too if you count f16), while core expects 8 bit atomics too (that is, unless its cfg-gated).
Core expects some operations to be available while they are not available, such as fetch_nand, we could implement this as a CAS loop but its a bit of an opaque behavior so im not too happy to do that.
CUDA has (limited) float atomics, which are the most used types of atomics by far, since GPU code often deals with floats. Core does not have atomic floats, so we would need a separate type, causing lots of confusion.
CUDA as previously mentioned has block and system atomics, which are unique to it.

Because of these limitations, we have a few options for implementing atomics:

Try our best to support core atomics, emulate anything thats missing with CAS loops, make AtomicF32 and AtomicF64 different types in cuda_std. Add block and system atomics as their own types in cuda_std::atomic. This maintains compat with core but splits up atomic types, which is not ideal.
Don't try to support core intrinsics at all, add everything in cuda_std::atomic, add only the methods that cuda natively supports without CAS loops. Don't try to make the atomics work on the CPU. This is easiest, has the nicest API, but doesn't work on the CPU.

Implementation Roadmap

Atomics will likely be implemented incrementally, most of the work is transferring over the raw intrinsics, after that, the hard part is done and we can just focus on the stable public API.

Device float atomics will be first, since it is by far the most used type of intrinsic. After that, the order will probably follow:

Integer Device Atomics -> Float System Atomics -> Integer System Atomics -> Float Block Atomics -> Integer Block Atomics -> Anything that's missing

Feedback

I'd love to hear any feedback you have! We must make sure this is implemented once and implemented correctly.

bytesnake commented 2 years ago

after reading the proposal (bear in mind that I have never used atomics on GPU side) and taking my personal use-cases into consideration, I think that rust-gpu should not gloss over the architectural differences with an abstraction layer:

system/device/block abstraction is making GPU fundamentally different
supported bit precisions may diverge at any point further, what is about f128 support and u8 for pruned models ?
users could just write two versions of their kernels, kept sanity/clarity outweights the additional effort (as you have to think about architectural difference anyways ..)
end-users would probably never seen them and just use e.g. ndarray or rayon

btw there is also crates.io/atomic_float adding AtomicF32 and AtomicF64 for x86 and other architectures

RDambrosio016 commented 2 years ago

My plan is not to gloss over the differences, its to expose gpu-specific atomics in cuda_std. However, i don't really want to do it fully in cuda_std because there is a lot of code that relies on core intrinsics on the CPU that would not work on the GPU. For example, if it uses an atomic counter.

So id like to find a balance between interop with core atomics and gpu-specific atomics in cuda_std. Such as perhaps defaulting to device atomics for core atomics, then exposing atomicf32 and atomicf64 in cuda_std that fall back to atomic_float on CPU.

Rust-GPU / Rust-CUDA