Closed JesseLu closed 6 months ago
No, these aren't supported yet as even LLVM doesn't have support for the necessary intrinsics: https://reviews.llvm.org/D100394.
CUDA C/C++ is implementing these with inlined assembly.
Thank you both. Is there a pointer to the CUDA reference Valentin?
On Tue, Apr 20, 2021, 7:45 AM Valentin Churavy @.***> wrote:
CUDA C/C++ is implementing these with inlined assembly.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaGPU/CUDA.jl/issues/850#issuecomment-823334148, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFQRGIOQB2Q24U77ULOKGLTJWHSHANCNFSM43HKGKIA .
Look for cp.async
in $CUDA/targets/x86_64-linux/include/cuda/std/barrier
.
Thank you!!
On Tue, Apr 20, 2021 at 11:01 PM Tim Besard @.***> wrote:
Look for cp.async in $CUDA/targets/x86_64-linux/include/cuda/std/barrier.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaGPU/CUDA.jl/issues/850#issuecomment-823800073, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFQRGPEHN3XSKBPBPBEHG3TJZS4HANCNFSM43HKGKIA .
With Hopper, this is becoming more important to target.
nvvm seems to have intrinsic for asynchronous copy, would this help?
nvvm seems to have intrinsic for asynchronous copy, would this help?
That's not NVVM, it's LLVM proper (implementing NVVM's intrinsics), so yes that would help. It also seems to be supported from LLVM 13 onwards, so we could support this on Julia 1.8+.
@maleadt Any pointer on where to start?
Calling these intrinsics is easy, you can generally derive the names from the tablegen entries (see e.g. the tests, https://github.com/llvm/llvm-project/blob/ff0aabf14dfa0a369c38f6e267c56a83ee48d647/llvm/test/CodeGen/NVPTX/async-copy.ll#L37-L40), and simply ccall
them like we do for many intrinsics already, e.g. https://github.com/JuliaGPU/CUDA.jl/blob/5c51766d0a9e7819ea79f314e37ed6a8a5d24369/src/device/intrinsics/misc.jl#L8
The tricky part is figuring out a good abstraction in Julia that exposes these intrinsics. Take a look at what CUDA C does, or if the CUDA C++ interface has something more high level.
I was reading the nvidia doc about async copy and it seems llvm only support a small portion of the instructions? Here are some code translated from the async-copy.ll
@inline llvm_nvvm_cp_async_wait_group(i::Int32) = ccall("llvm.nvvm.cp.async.wait.group", llvmcall, Cvoid, (Int32,), i)
@inline llvm_nvvm_cp_async_wait_all() = ccall("llvm.nvvm.cp.async.wait.all", llvmcall, Cvoid, ())
@inline llvm_nvvm_cp_async_commit_group() = ccall("llvm.nvvm.cp.async.commit.group", llvmcall, Cvoid, ())
@inline llvm_nvvm_cp_async_mbarrier_arrive(ptr::LLVMPtr{T, A}) where {T, A} =
ccall("llvm.nvvm.cp.async.mbarrier.arrive", llvmcall, Cvoid, (LLVMPtr{T, A},), ptr)
@inline llvm_nvvm_cp_async_mbarrier_arrive_shared(ptr::LLVMPtr{T, AS.Shared}) where T =
ccall("llvm.nvvm.cp.async.mbarrier.arrive.shared", llvmcall, Cvoid, (LLVMPtr{T, AS.Shared},), ptr)
@inline llvm_nvvm_cp_async_mbarrier_arrive_noinc(ptr::LLVMPtr{T, A}) where {T, A} =
ccall("llvm.nvvm.cp.async.mbarrier.arrive.noinc", llvmcall, Cvoid, (LLVMPtr{T, A},), ptr)
@inline llvm_nvvm_cp_async_mbarrier_arrive_noinc_shared(ptr::LLVMPtr{T, AS.Shared}) where T =
ccall("llvm.nvvm.cp.async.mbarrier.arrive.noinc.shared", llvmcall, Cvoid, (LLVMPtr{T, AS.Shared},), ptr)
@inline llvm_nvvm_cp_async_ca_shared_global_4(dst::LLVMPtr{T, AS.Shared}, src::LLVMPtr{T, AS.Global}) where T =
ccall("llvm.nvvm.cp.async.ca.shared.global.4", llvmcall, Cvoid,
(LLVMPtr{T, AS.Shared}, LLVMPtr{T, AS.Global}), dst, src)
@inline llvm_nvvm_cp_async_ca_shared_global_8(dst::LLVMPtr{T, AS.Shared}, src::LLVMPtr{T, AS.Global}) where T =
ccall("llvm.nvvm.cp.async.ca.shared.global.8", llvmcall, Cvoid,
(LLVMPtr{T, AS.Shared}, LLVMPtr{T, AS.Global}), dst, src)
@inline llvm_nvvm_cp_async_ca_shared_global_16(dst::LLVMPtr{T, AS.Shared}, src::LLVMPtr{T, AS.Global}) where T =
ccall("llvm.nvvm.cp.async.ca.shared.global.16", llvmcall, Cvoid,
(LLVMPtr{T, AS.Shared}, LLVMPtr{T, AS.Global}), dst, src)
@inline llvm_nvvm_cp_async_cg_shared_global_16(dst::LLVMPtr{T, AS.Shared}, src::LLVMPtr{T, AS.Global}) where T =
ccall("llvm.nvvm.cp.async.cg.shared.global.16", llvmcall, Cvoid,
(LLVMPtr{T, AS.Shared}, LLVMPtr{T, AS.Global}), dst, src)
I have no idea how to use those mbarrier
instructions. But for the other part, those llvm_nvvm_cp_async_ca_shared_global_<n>
could be merged into a async_copyto!
function and the commit/wait function is just a single function call. If these are the only functions we can use right now, I think there aren't much things we could do for a more high level abstraction in Julia.
What pieces are missing? You can always use inline assembly for those. Grep for asmcall
in CUDA.jl.
And FWIW wrappers like the ones you added above aren't very useful, the idea is indeed to make something high-level like async_copyto!
and have that ccall
directly; no need for functions that directly map onto intrinsics.
What pieces are missing?
According to the doc, there are some extra argument of cp.async.ca.shared.global
like ignore-src
.
the idea is indeed to make something high-level like async_copyto! and have that ccall directly; no need for functions that directly map onto intrinsics.
I was imitating what device/intrinsics/wmma.jl
do. And I think it would be better to somehow expose those llvm_nvvm_cp_async_ca_shared_global_<n>
because it can be use to do some kind of vectorized copy depend on the bit size. I'm not sure if we have any common used abstraction in Julia for vectorized copy.
memcpy_async
is implemented now.
I'm writing a 3D eletromagnetics FDTD code in CUDA, and it's been a lot of fun leaning on Julia's code generation utilities, combined with CUDA.jl. Thank you for all the amazing work, this definitely feels like a step forward from PyCUDA (which I previously used for a similar project).
I'm pretty sure you get this a lot (i.e. is X CUDA API exposed in CUDA.jl?), so I'm sorry if the answer is obvious. That said...
Is the
memcpy_async()
API exposed in CUDA.jl (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memcpy-async-api). This allows one to overlap compute and memory explicitly, and is looking to be an important optimization in my code (which is heavily memory bound). Critically, use of the API can alleviate register pressure and directly move data from global to shared memory (without going to registers first).Thank you!