Support for `sm_80` `cp.async`: asynchronous on-device copies

JesseLu commented 3 years ago

I'm writing a 3D eletromagnetics FDTD code in CUDA, and it's been a lot of fun leaning on Julia's code generation utilities, combined with CUDA.jl. Thank you for all the amazing work, this definitely feels like a step forward from PyCUDA (which I previously used for a similar project).

I'm pretty sure you get this a lot (i.e. is X CUDA API exposed in CUDA.jl?), so I'm sorry if the answer is obvious. That said...

Is the memcpy_async() API exposed in CUDA.jl (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memcpy-async-api). This allows one to overlap compute and memory explicitly, and is looking to be an important optimization in my code (which is heavily memory bound). Critically, use of the API can alleviate register pressure and directly move data from global to shared memory (without going to registers first).

Thank you!

maleadt commented 3 years ago

No, these aren't supported yet as even LLVM doesn't have support for the necessary intrinsics: https://reviews.llvm.org/D100394.

vchuravy commented 3 years ago

CUDA C/C++ is implementing these with inlined assembly.

JesseLu commented 3 years ago

Thank you both. Is there a pointer to the CUDA reference Valentin?

On Tue, Apr 20, 2021, 7:45 AM Valentin Churavy @.***> wrote:

CUDA C/C++ is implementing these with inlined assembly.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaGPU/CUDA.jl/issues/850#issuecomment-823334148, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFQRGIOQB2Q24U77ULOKGLTJWHSHANCNFSM43HKGKIA .

maleadt commented 3 years ago

Look for cp.async in $CUDA/targets/x86_64-linux/include/cuda/std/barrier.

JesseLu commented 3 years ago

Thank you!!

On Tue, Apr 20, 2021 at 11:01 PM Tim Besard @.***> wrote:

Look for cp.async in $CUDA/targets/x86_64-linux/include/cuda/std/barrier.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaGPU/CUDA.jl/issues/850#issuecomment-823800073, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFQRGPEHN3XSKBPBPBEHG3TJZS4HANCNFSM43HKGKIA .

maleadt commented 2 years ago

With Hopper, this is becoming more important to target.

chengchingwen commented 1 year ago

nvvm seems to have intrinsic for asynchronous copy, would this help?

maleadt commented 1 year ago

nvvm seems to have intrinsic for asynchronous copy, would this help?

That's not NVVM, it's LLVM proper (implementing NVVM's intrinsics), so yes that would help. It also seems to be supported from LLVM 13 onwards, so we could support this on Julia 1.8+.

chengchingwen commented 1 year ago

@maleadt Any pointer on where to start?

maleadt commented 1 year ago

Calling these intrinsics is easy, you can generally derive the names from the tablegen entries (see e.g. the tests, https://github.com/llvm/llvm-project/blob/ff0aabf14dfa0a369c38f6e267c56a83ee48d647/llvm/test/CodeGen/NVPTX/async-copy.ll#L37-L40), and simply ccall them like we do for many intrinsics already, e.g. https://github.com/JuliaGPU/CUDA.jl/blob/5c51766d0a9e7819ea79f314e37ed6a8a5d24369/src/device/intrinsics/misc.jl#L8

The tricky part is figuring out a good abstraction in Julia that exposes these intrinsics. Take a look at what CUDA C does, or if the CUDA C++ interface has something more high level.

chengchingwen commented 1 year ago

I was reading the nvidia doc about async copy and it seems llvm only support a small portion of the instructions? Here are some code translated from the async-copy.ll

@inline llvm_nvvm_cp_async_wait_group(i::Int32) = ccall("llvm.nvvm.cp.async.wait.group", llvmcall, Cvoid, (Int32,), i)
@inline llvm_nvvm_cp_async_wait_all() = ccall("llvm.nvvm.cp.async.wait.all", llvmcall, Cvoid, ())
@inline llvm_nvvm_cp_async_commit_group() = ccall("llvm.nvvm.cp.async.commit.group", llvmcall, Cvoid, ())
@inline llvm_nvvm_cp_async_mbarrier_arrive(ptr::LLVMPtr{T, A}) where {T, A} =
    ccall("llvm.nvvm.cp.async.mbarrier.arrive", llvmcall, Cvoid, (LLVMPtr{T, A},), ptr)
@inline llvm_nvvm_cp_async_mbarrier_arrive_shared(ptr::LLVMPtr{T, AS.Shared}) where T =
    ccall("llvm.nvvm.cp.async.mbarrier.arrive.shared", llvmcall, Cvoid, (LLVMPtr{T, AS.Shared},), ptr)
@inline llvm_nvvm_cp_async_mbarrier_arrive_noinc(ptr::LLVMPtr{T, A}) where {T, A} =
    ccall("llvm.nvvm.cp.async.mbarrier.arrive.noinc", llvmcall, Cvoid, (LLVMPtr{T, A},), ptr)
@inline llvm_nvvm_cp_async_mbarrier_arrive_noinc_shared(ptr::LLVMPtr{T, AS.Shared}) where T =
    ccall("llvm.nvvm.cp.async.mbarrier.arrive.noinc.shared", llvmcall, Cvoid, (LLVMPtr{T, AS.Shared},), ptr)
@inline llvm_nvvm_cp_async_ca_shared_global_4(dst::LLVMPtr{T, AS.Shared}, src::LLVMPtr{T, AS.Global}) where T =
    ccall("llvm.nvvm.cp.async.ca.shared.global.4", llvmcall, Cvoid,
          (LLVMPtr{T, AS.Shared}, LLVMPtr{T, AS.Global}), dst, src)
@inline llvm_nvvm_cp_async_ca_shared_global_8(dst::LLVMPtr{T, AS.Shared}, src::LLVMPtr{T, AS.Global}) where T =
    ccall("llvm.nvvm.cp.async.ca.shared.global.8", llvmcall, Cvoid,
          (LLVMPtr{T, AS.Shared}, LLVMPtr{T, AS.Global}), dst, src)
@inline llvm_nvvm_cp_async_ca_shared_global_16(dst::LLVMPtr{T, AS.Shared}, src::LLVMPtr{T, AS.Global}) where T =
    ccall("llvm.nvvm.cp.async.ca.shared.global.16", llvmcall, Cvoid,
          (LLVMPtr{T, AS.Shared}, LLVMPtr{T, AS.Global}), dst, src)
@inline llvm_nvvm_cp_async_cg_shared_global_16(dst::LLVMPtr{T, AS.Shared}, src::LLVMPtr{T, AS.Global}) where T =
    ccall("llvm.nvvm.cp.async.cg.shared.global.16", llvmcall, Cvoid,
          (LLVMPtr{T, AS.Shared}, LLVMPtr{T, AS.Global}), dst, src)

I have no idea how to use those mbarrier instructions. But for the other part, those llvm_nvvm_cp_async_ca_shared_global_<n> could be merged into a async_copyto! function and the commit/wait function is just a single function call. If these are the only functions we can use right now, I think there aren't much things we could do for a more high level abstraction in Julia.

maleadt commented 1 year ago

What pieces are missing? You can always use inline assembly for those. Grep for asmcall in CUDA.jl. And FWIW wrappers like the ones you added above aren't very useful, the idea is indeed to make something high-level like async_copyto! and have that ccall directly; no need for functions that directly map onto intrinsics.

chengchingwen commented 1 year ago

What pieces are missing?

According to the doc, there are some extra argument of cp.async.ca.shared.globallike ignore-src.

the idea is indeed to make something high-level like async_copyto! and have that ccall directly; no need for functions that directly map onto intrinsics.

I was imitating what device/intrinsics/wmma.jldo. And I think it would be better to somehow expose those llvm_nvvm_cp_async_ca_shared_global_<n> because it can be use to do some kind of vectorized copy depend on the bit size. I'm not sure if we have any common used abstraction in Julia for vectorized copy.

maleadt commented 6 months ago

memcpy_async is implemented now.

JuliaGPU / CUDA.jl

Support for `sm_80` `cp.async`: asynchronous on-device copies #850