Open tkf opened 2 years ago
I remembered that CppCon 2019: Olivier Giroux “The One-Decade Task: Putting std::atomic in CUDA.” - YouTube was a good talk on how Nvidia tackled the memory model. Just re-skimming the talk, it does look like proper ordering semantics did not exist before libcu++.
I hadn't added these because I'm currently unfamiliar with the intricacies of memory ordering. I guess I'll have to investigate at some point...
I noticed atomics.jl is using
It's used here: https://github.com/JuliaGPU/CUDA.jl/blob/55ed09930082bede753d49e455e4256af37277e3/src/device/intrinsics/atomics.jl#L43-L45
It looks like the PTX ISA supports various ordering using instruction qualifiers, https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#operation-types, but I'm not sure NVPTX exposes these.
If you just want to implement what Base.@atomic
does, you can just forward the ordering to LLVM (except unordered
) just like Base
does. Since they directly correspond to C++'s ordering, maybe you can compare it with libcu++ output.
The PTX ISA documentation is interesting! ...and confusing that they consider .relaxed
as a "strong" operation. Relaxed ordering in C/C++ is supposed to do no synchronization on its own (and so making it scope-aware is also confusing). So I don't understand why they need .weak
. I wonder if it is still C++20 memory model -compliant to compile down Julia-level :monotonic
(and similarly C++'s relaxed
) to .weak
in PTX. Or maybe using PTX .weak
as C++ relaxed
requires stronger fences or something...? Atomics on CUDA does look extra tricky me.
Is your feature request related to a problem? Please describe.
CUDA.jl doesn't provide APIs to specify memory ordering of atomic operations. In particular, it would be nice to be able to use the monotonic (aka relaxed) ordering for applications like histogram and gradient descent.
Describe the solution you'd like
Provide the full set of Julia 1.7 atomic orderings.
Describe alternatives you've considered
Using monotonic always in
CUDA.atomic_*
andCUDA.@atomic
may actually be a decent solution, given the dominating applications in Julia. Other orderings don't matter unless you want to do very intricate concurrent programming. However, it would be nice to align the syntax toBase
to facilitate writing code that is usable both on GPU and CPU.Additional context
I noticed atomics.jl is using
https://github.com/JuliaGPU/CUDA.jl/blob/55ed09930082bede753d49e455e4256af37277e3/src/device/intrinsics/atomics.jl#L13-L17
by default but I couldn't find the API to set the ordering.
I'm also a bit confused about atomics situation on CUDA in general. Looking at CUDA C++ Programming Guide, B.14. Atomic Functions says
Indeed, functions like
atomicCAS
andatomicAdd
don't seem to take ordering as arguments. So, I wonder if they are all monotonic?On the other hand, https://github.com/NVIDIA/libcudacxx does seem to try to provide C++'s std atomics interface. I can see that different assemblies are generated with different orderings:
cuda::std::atomic::compare_exchange_weak
: https://godbolt.org/z/nYqKx6WE8cuda::std::atomic::fetch_add
,atomicAdd
,atomicAdd_system
: https://godbolt.org/z/o6areY84z (I'm bit puzzled thatstd::atomic
andatomicAdd
produce different instructions)So I guess it means actually the hardware supports different orderings? (But previously hasn't been exposed to the programmer before libcu++?)
Anyway, I think it'd be nice if we have access to these instructions, I think it's important to at least provide monotonic ordering which is what you need for "reduction" resulting into a large object like histogram.