Open RDambrosio016 opened 2 years ago
after reading the proposal (bear in mind that I have never used atomics on GPU side) and taking my personal use-cases into consideration, I think that rust-gpu should not gloss over the architectural differences with an abstraction layer:
f128
support and u8
for pruned models ?ndarray
or rayon
btw there is also crates.io/atomic_float adding AtomicF32 and AtomicF64 for x86
and other architectures
My plan is not to gloss over the differences, its to expose gpu-specific atomics in cuda_std. However, i don't really want to do it fully in cuda_std because there is a lot of code that relies on core intrinsics on the CPU that would not work on the GPU. For example, if it uses an atomic counter.
So id like to find a balance between interop with core atomics and gpu-specific atomics in cuda_std. Such as perhaps defaulting to device atomics for core atomics, then exposing atomicf32 and atomicf64 in cuda_std that fall back to atomic_float on CPU.
This issue serves as a design document and a discussion on how atomics will/should be implemented.
CUDA Background
CUDA has had atomics for basically forever in the form of a few functions like
atomicAdd
,atomicCAS
, etc. See the docs on it here. It also has_system
and_block
variants of them.This has always been the overwhelmingly popular way of doing atomic things in CUDA, and for a while it was the only way, until compute 7.x. sm_70 introduced the
.sem
qualifier on theatom
PTX instruction. This allowed users to specify a specific ordering for atomic operations.CUDA decided to implement this by replicating
std::atomic
as its own thing calledcuda::std::atomic
. Atomic provides a generic container for atomic operations on types such as int. It offers atomic operations with user-specified orderings.Usage of
cuda::std::atomic
Despite NVIDIA pushing for users to use atomic, it has not seen wide adoption, presumably because of the following reasons:
cuda::std::atomic
is a mess of templates and inheritance because CUDA wanted to make it compatible with the GPU, the CPU (with every compiler's weird atomic semantics), and user-defined functions. This yields weird errors and confusing dependency graphs.atomicAdd
and similar. Unless you are deeply knowledgeable about CUDA you would not switch to atomic, if you even knew it existed.Importance of great atomics
Atomics are the core of many algorithms, therefore it is imperative for a project of this scale to implement them once and implement them well. Otherwise a poor implementation of them might mean users being stuck with such an implementation forever, as with CUDA's case. Therefore, i believe we should take our time with atomics and implement them once and do it well.
Low level implementation
The low level implementation of such atomics is not very difficult, it can mostly be taken from how
cuda::std::atomic
does it at the low level. It implements them in the following way:If the CUDA Arch is >= 7.0 then it uses specialized PTX instructions with asm:
With seqcst additionally containing a fence before it:
This can very easily be replicated by us since we have full support for inline asm.
Otherwise, if the arch is less than 7.0, it "emulates" it with barriers:
You can find the code for this in
CUDA_ROOT\include\cuda\std\detail\libcxx\include\support\atomic\atomic_cuda_generated.h
for CUDA 11.5, andCUDA_ROOT\include\cuda\std\detail\__atomic_generated
for older versions.That file provides functions as intrinsics that the rest of libcu++ build off of:
Rust Intrinsic implementation
I propose we follow a similar approach of raw unsafe intrinsics for:
sm_70+ intrinsics are implemented in
cuda_std::atomic::intrinsics::sm_70
, emulated intrinsics are incuda_std;:atomic::intrinsics::emulated
.Wrappers of the sm-specific intrinsics are in
cuda_std::atomic::intrinsics
. For example:High level types
And finally, we expose high level types in
cuda_std::atomic
such asAtomicF32
,AtomicF64
, etc.Block atomics (
BlockAtomicF32
) will need to be unsafe, this is because for device atomics, it is up to the caller of the kernels to ensure buffers and kernels do not contain data races, and systems prevent this. However, block atomics do not, it would be very easy to accidentally cause data races if the accesses are not intra-threadblock.Atomic types will expose operations that they specifically allow, for example, per the ISA spec:
fetch_and
,fetch_or
,fetch_xor
,compare_and_swap
, andexchange
.fetch_add
,fetch_inc
,fetch_dec
,fetch_min
, andfetch_max
.fetch_inc
andfetch_add
that clamp to[0..b]
(unsure if this means 0..MAX or something else).fetch_add
Compatibility with core atomics
Core exposes atomics with a couple of things:
In addition to atomic_store, atomic_rmw, atomic_cmpxchg, and a couple more. We currently trap in all of these functions, partly because libnvvm doesn't support atomic instructions for many types, and partly because we want to bikeshed how to implement them nicely.
However, as expected, things are not quite the same on the CPU and the GPU, there are some very important differences:
fetch_nand
, we could implement this as a CAS loop but its a bit of an opaque behavior so im not too happy to do that.Because of these limitations, we have a few options for implementing atomics:
AtomicF32
andAtomicF64
different types in cuda_std. Add block and system atomics as their own types incuda_std::atomic
. This maintains compat with core but splits up atomic types, which is not ideal.cuda_std::atomic
, add only the methods that cuda natively supports without CAS loops. Don't try to make the atomics work on the CPU. This is easiest, has the nicest API, but doesn't work on the CPU.Implementation Roadmap
Atomics will likely be implemented incrementally, most of the work is transferring over the raw intrinsics, after that, the hard part is done and we can just focus on the stable public API.
Device float atomics will be first, since it is by far the most used type of intrinsic. After that, the order will probably follow:
Integer Device Atomics -> Float System Atomics -> Integer System Atomics -> Float Block Atomics -> Integer Block Atomics -> Anything that's missing
Feedback
I'd love to hear any feedback you have! We must make sure this is implemented once and implemented correctly.