Rust-GPU / Rust-CUDA

Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.
Apache License 2.0
3.12k stars 120 forks source link

Atomics design doc and discussion #8

Open RDambrosio016 opened 2 years ago

RDambrosio016 commented 2 years ago

This issue serves as a design document and a discussion on how atomics will/should be implemented.

CUDA Background

CUDA has had atomics for basically forever in the form of a few functions like atomicAdd, atomicCAS, etc. See the docs on it here. It also has _system and _block variants of them.

This has always been the overwhelmingly popular way of doing atomic things in CUDA, and for a while it was the only way, until compute 7.x. sm_70 introduced the .sem qualifier on the atom PTX instruction. This allowed users to specify a specific ordering for atomic operations.

CUDA decided to implement this by replicating std::atomic as its own thing called cuda::std::atomic. Atomic provides a generic container for atomic operations on types such as int. It offers atomic operations with user-specified orderings.

Usage of cuda::std::atomic

Despite NVIDIA pushing for users to use atomic, it has not seen wide adoption, presumably because of the following reasons:

Importance of great atomics

Atomics are the core of many algorithms, therefore it is imperative for a project of this scale to implement them once and implement them well. Otherwise a poor implementation of them might mean users being stuck with such an implementation forever, as with CUDA's case. Therefore, i believe we should take our time with atomics and implement them once and do it well.

Low level implementation

The low level implementation of such atomics is not very difficult, it can mostly be taken from how cuda::std::atomic does it at the low level. It implements them in the following way:

If the CUDA Arch is >= 7.0 then it uses specialized PTX instructions with asm:

template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_acq_rel_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.acq_rel.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_acquire_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.acquire.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_relaxed_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.relaxed.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_release_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.release.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_volatile_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }

With seqcst additionally containing a fence before it:

        switch (__memorder) {
          case __ATOMIC_SEQ_CST: __cuda_fence_sc_device();
          case __ATOMIC_CONSUME:
          case __ATOMIC_ACQUIRE: __cuda_fetch_add_acquire_32_device(__ptr, __tmp, __tmp); break;
          case __ATOMIC_ACQ_REL: __cuda_fetch_add_acq_rel_32_device(__ptr, __tmp, __tmp); break;
          case __ATOMIC_RELEASE: __cuda_fetch_add_release_32_device(__ptr, __tmp, __tmp); break;
          case __ATOMIC_RELAXED: __cuda_fetch_add_relaxed_32_device(__ptr, __tmp, __tmp); break;
          default: assert(0);
        }

This can very easily be replicated by us since we have full support for inline asm.

Otherwise, if the arch is less than 7.0, it "emulates" it with barriers:

        switch (__memorder) {
          case __ATOMIC_SEQ_CST:
          case __ATOMIC_ACQ_REL: __cuda_membar_device();
          case __ATOMIC_CONSUME:
          case __ATOMIC_ACQUIRE: __cuda_fetch_add_volatile_32_device(__ptr, __tmp, __tmp); __cuda_membar_device(); break;
          case __ATOMIC_RELEASE: __cuda_membar_device(); __cuda_fetch_add_volatile_32_device(__ptr, __tmp, __tmp); break;
          case __ATOMIC_RELAXED: __cuda_fetch_add_volatile_32_device(__ptr, __tmp, __tmp); break;
          default: assert(0);
        }

You can find the code for this in CUDA_ROOT\include\cuda\std\detail\libcxx\include\support\atomic\atomic_cuda_generated.h for CUDA 11.5, and CUDA_ROOT\include\cuda\std\detail\__atomic_generated for older versions.

That file provides functions as intrinsics that the rest of libcu++ build off of:

template<class _Type, typename cuda::std::enable_if<sizeof(_Type)==4, int>::type = 0>
__device__ _Type __atomic_fetch_add_cuda(volatile _Type *__ptr, _Type __val, int __memorder, __thread_scope_device_tag) {
  /* ... */
}

Rust Intrinsic implementation

I propose we follow a similar approach of raw unsafe intrinsics for:

sm_70+ intrinsics are implemented in cuda_std::atomic::intrinsics::sm_70, emulated intrinsics are in cuda_std;:atomic::intrinsics::emulated.

Wrappers of the sm-specific intrinsics are in cuda_std::atomic::intrinsics. For example:

pub unsafe fn atomic_fetch_add_f32_device(ptr: *const f32, ordering: Ordering, val: f32) -> f32;

High level types

And finally, we expose high level types in cuda_std::atomic such as AtomicF32, AtomicF64, etc.

Block atomics (BlockAtomicF32) will need to be unsafe, this is because for device atomics, it is up to the caller of the kernels to ensure buffers and kernels do not contain data races, and systems prevent this. However, block atomics do not, it would be very easy to accidentally cause data races if the accesses are not intra-threadblock.

Atomic types will expose operations that they specifically allow, for example, per the ISA spec:

Compatibility with core atomics

Core exposes atomics with a couple of things:

    fn atomic_load(
        &mut self,
        ty: &'ll Type,
        ptr: &'ll Value,
        _order: AtomicOrdering,
        _size: Size,
    ) -> &'ll Value {
        /* ... */
    }

In addition to atomic_store, atomic_rmw, atomic_cmpxchg, and a couple more. We currently trap in all of these functions, partly because libnvvm doesn't support atomic instructions for many types, and partly because we want to bikeshed how to implement them nicely.

However, as expected, things are not quite the same on the CPU and the GPU, there are some very important differences:

Because of these limitations, we have a few options for implementing atomics:

Implementation Roadmap

Atomics will likely be implemented incrementally, most of the work is transferring over the raw intrinsics, after that, the hard part is done and we can just focus on the stable public API.

Device float atomics will be first, since it is by far the most used type of intrinsic. After that, the order will probably follow:

Integer Device Atomics -> Float System Atomics -> Integer System Atomics -> Float Block Atomics -> Integer Block Atomics -> Anything that's missing

Feedback

I'd love to hear any feedback you have! We must make sure this is implemented once and implemented correctly.

bytesnake commented 2 years ago

after reading the proposal (bear in mind that I have never used atomics on GPU side) and taking my personal use-cases into consideration, I think that rust-gpu should not gloss over the architectural differences with an abstraction layer:

btw there is also crates.io/atomic_float adding AtomicF32 and AtomicF64 for x86 and other architectures

RDambrosio016 commented 2 years ago

My plan is not to gloss over the differences, its to expose gpu-specific atomics in cuda_std. However, i don't really want to do it fully in cuda_std because there is a lot of code that relies on core intrinsics on the CPU that would not work on the GPU. For example, if it uses an atomic counter.

So id like to find a balance between interop with core atomics and gpu-specific atomics in cuda_std. Such as perhaps defaulting to device atomics for core atomics, then exposing atomicf32 and atomicf64 in cuda_std that fall back to atomic_float on CPU.