Open leofang opened 3 weeks ago
(Tentatively assigned to Federico as per our offline discussion π)
This is a blocker for nvmath-python to get rid of its mandatory dependency on CuPy (so that CuPy can in turn depend on nvmath-python, without hitting circular dependency issues).
cc: @kmaehashi for vis
This routine should be JIT-compilable (by NVRTC)
Can you elaborate on how you envision this would work? This is necessarily a host API and NVRTC can't compile host-code.
This is necessarily a host API and NVRTC can't compile host-code.
We have a C library now, don't we? π
@jrhemstad @gevtushenko Correct me if I am wrong since I am not fluent enough in mdspan
: Given that shape, strides, and dtype are all run-time properties in Python, if this were a host API we would have had to instantiate a whole lot of copy kernel instances, and even so it would not cover all possibilities. Therefore, I feel NVRTC compatibility (which is a requirement of the C library anyway) is necessary.
Another reason for NVRTC compatibility: I think to unblock nvmath-python, we should just focus on the D2D copies (between potentially two different memory layouts) for now, and let nvmath-python handles the remaining H2D/D2H parts which should be easy (just use cudaMemcpyAsync
with a staging buffer) and is already what CuPy does for us today. And I presume a D2D copy can be achieved by a single kernel compiled by NVRTC.
We have a C library now, don't we?
So what you really mean is "Provide a solution that doesn't require pre-instantiating a lot of kernels and may internally use NVRTC to JIT compile specific kernel instantiations".
By "NVRTC compatible" I understood you wanted it so someone could take cuda::copy(mdspan, mdspan)
and compile it directly with NVRTC on their own. This wouldn't be feasible anymore than it is for someone to try and compile cub::DeviceReduce
with NVRTC on their own.
I believe you are right. We should think of this new copy routine as if it were a CUB device-wide algorithm.
What I originally had in mind is really just a kernel and I wanted to do pre-/post- processing as well as kernel compilation/launch myself, but I had forgotten that this does not fit in the compute paradigm anywhere in CCCL. Thanks for the clarifying questions.
Is this a duplicate?
Area
libcu++
Is your feature request related to a problem? Please describe.
CUDA Python & nvmath-python need to have a copy routine added to CCCL for copying from one
mdspan
to another. The requirements for this copy routine include:mdspan
s covering either host or device tensors, so that H2D/D2H copies can be abstracted out by the same API.This is a blocker for nvmath-python to get rid of its mandatory dependency on CuPy (so that CuPy can in turn depend on nvmath-python, without hitting circular dependency issues).
We believe if src and dst are not overlapping, and if both resides on the device, there might be existing implementations from cuTENSOR (ex: cutensorPermute) based on which we can do a prototype. We can focus on functionalities first (right now the copy kernel used in nvmath-python is from CuPy), and in the future iterations improve the performance.
Describe the solution you'd like
Not sure what's the best solution, so just a thought: Perhaps offering an overload of
cuda::std::copy
that is specialized formdspan
?Describe alternatives you've considered
No response
Additional context
Once this routine is offered, a Python abstraction can be built in CUDA Python or elsewhere.