JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.22k stars 221 forks source link

Memory ordering APIs for atomic operations #1353

Open tkf opened 2 years ago

tkf commented 2 years ago

Is your feature request related to a problem? Please describe.

CUDA.jl doesn't provide APIs to specify memory ordering of atomic operations. In particular, it would be nice to be able to use the monotonic (aka relaxed) ordering for applications like histogram and gradient descent.

Describe the solution you'd like

Provide the full set of Julia 1.7 atomic orderings.

Describe alternatives you've considered

Using monotonic always in CUDA.atomic_* and CUDA.@atomic may actually be a decent solution, given the dominating applications in Julia. Other orderings don't matter unless you want to do very intricate concurrent programming. However, it would be nice to align the syntax to Base to facilitate writing code that is usable both on GPU and CPU.

Additional context

I noticed atomics.jl is using

https://github.com/JuliaGPU/CUDA.jl/blob/55ed09930082bede753d49e455e4256af37277e3/src/device/intrinsics/atomics.jl#L13-L17

by default but I couldn't find the API to set the ordering.

I'm also a bit confused about atomics situation on CUDA in general. Looking at CUDA C++ Programming Guide, B.14. Atomic Functions says

Atomic functions do not act as memory fences and do not imply synchronization or ordering constraints for memory operations (see Memory Fence Functions for more details on memory fences). Atomic functions can only be used in device functions.

Indeed, functions like atomicCAS and atomicAdd don't seem to take ordering as arguments. So, I wonder if they are all monotonic?

On the other hand, https://github.com/NVIDIA/libcudacxx does seem to try to provide C++'s std atomics interface. I can see that different assemblies are generated with different orderings:

So I guess it means actually the hardware supports different orderings? (But previously hasn't been exposed to the programmer before libcu++?)

Anyway, I think it'd be nice if we have access to these instructions, I think it's important to at least provide monotonic ordering which is what you need for "reduction" resulting into a large object like histogram.

tkf commented 2 years ago

I remembered that CppCon 2019: Olivier Giroux “The One-Decade Task: Putting std::atomic in CUDA.” - YouTube was a good talk on how Nvidia tackled the memory model. Just re-skimming the talk, it does look like proper ordering semantics did not exist before libcu++.

maleadt commented 2 years ago

I hadn't added these because I'm currently unfamiliar with the intricacies of memory ordering. I guess I'll have to investigate at some point...

I noticed atomics.jl is using

https://github.com/JuliaGPU/CUDA.jl/blob/55ed09930082bede753d49e455e4256af37277e3/src/device/intrinsics/atomics.jl#L13-L17

It's used here: https://github.com/JuliaGPU/CUDA.jl/blob/55ed09930082bede753d49e455e4256af37277e3/src/device/intrinsics/atomics.jl#L43-L45

It looks like the PTX ISA supports various ordering using instruction qualifiers, https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#operation-types, but I'm not sure NVPTX exposes these.

tkf commented 2 years ago

If you just want to implement what Base.@atomic does, you can just forward the ordering to LLVM (except unordered) just like Base does. Since they directly correspond to C++'s ordering, maybe you can compare it with libcu++ output.

The PTX ISA documentation is interesting! ...and confusing that they consider .relaxed as a "strong" operation. Relaxed ordering in C/C++ is supposed to do no synchronization on its own (and so making it scope-aware is also confusing). So I don't understand why they need .weak. I wonder if it is still C++20 memory model -compliant to compile down Julia-level :monotonic (and similarly C++'s relaxed) to .weak in PTX. Or maybe using PTX .weak as C++ relaxed requires stronger fences or something...? Atomics on CUDA does look extra tricky me.