[FEA]: Add a copy routine to support data copy between two `mdspan`s

leofang commented 3 weeks ago

Is this a duplicate?

[x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

libcu++

Is your feature request related to a problem? Please describe.

CUDA Python & nvmath-python need to have a copy routine added to CCCL for copying from one mdspan to another. The requirements for this copy routine include:

This routine would copy contents from ndarray (or a N-D tensor) A with certain data type, shape & strides to ndarray B with the same dtype & shape but not necessarily same strides.
- Since the underlying ndarrays are strided, they are not necessarily contiguous in memory or share the same memory layout, thus a dedicated copy kernel is needed
This routine can handle mdspans covering either host or device tensors, so that H2D/D2H copies can be abstracted out by the same API.
- In the case of D2H copies, synchronous copies are fine
This routine should be JIT-compilable (by NVRTC) to serve Python users better

This is a blocker for nvmath-python to get rid of its mandatory dependency on CuPy (so that CuPy can in turn depend on nvmath-python, without hitting circular dependency issues).

We believe if src and dst are not overlapping, and if both resides on the device, there might be existing implementations from cuTENSOR (ex: cutensorPermute) based on which we can do a prototype. We can focus on functionalities first (right now the copy kernel used in nvmath-python is from CuPy), and in the future iterations improve the performance.

Describe the solution you'd like

Not sure what's the best solution, so just a thought: Perhaps offering an overload of cuda::std::copy that is specialized for mdspan?

Describe alternatives you've considered

No response

Additional context

Once this routine is offered, a Python abstraction can be built in CUDA Python or elsewhere.

leofang commented 3 weeks ago

(Tentatively assigned to Federico as per our offline discussion 🙂)

leofang commented 3 weeks ago

This is a blocker for nvmath-python to get rid of its mandatory dependency on CuPy (so that CuPy can in turn depend on nvmath-python, without hitting circular dependency issues).

cc: @kmaehashi for vis

jrhemstad commented 2 weeks ago

This routine should be JIT-compilable (by NVRTC)

Can you elaborate on how you envision this would work? This is necessarily a host API and NVRTC can't compile host-code.

leofang commented 1 week ago

This is necessarily a host API and NVRTC can't compile host-code.

We have a C library now, don't we? 🙂

@jrhemstad @gevtushenko Correct me if I am wrong since I am not fluent enough in mdspan: Given that shape, strides, and dtype are all run-time properties in Python, if this were a host API we would have had to instantiate a whole lot of copy kernel instances, and even so it would not cover all possibilities. Therefore, I feel NVRTC compatibility (which is a requirement of the C library anyway) is necessary.

leofang commented 1 week ago

Another reason for NVRTC compatibility: I think to unblock nvmath-python, we should just focus on the D2D copies (between potentially two different memory layouts) for now, and let nvmath-python handles the remaining H2D/D2H parts which should be easy (just use cudaMemcpyAsync with a staging buffer) and is already what CuPy does for us today. And I presume a D2D copy can be achieved by a single kernel compiled by NVRTC.

jrhemstad commented 1 week ago

We have a C library now, don't we?

So what you really mean is "Provide a solution that doesn't require pre-instantiating a lot of kernels and may internally use NVRTC to JIT compile specific kernel instantiations".

By "NVRTC compatible" I understood you wanted it so someone could take cuda::copy(mdspan, mdspan) and compile it directly with NVRTC on their own. This wouldn't be feasible anymore than it is for someone to try and compile cub::DeviceReduce with NVRTC on their own.

leofang commented 1 week ago

I believe you are right. We should think of this new copy routine as if it were a CUB device-wide algorithm.

What I originally had in mind is really just a kernel and I wanted to do pre-/post- processing as well as kernel compilation/launch myself, but I had forgotten that this does not fit in the compute paradigm anywhere in CCCL. Thanks for the clarifying questions.

NVIDIA / cccl