apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
149 stars 34 forks source link

feat(extensions/nanoarrow_device): Implement asynchronous buffer copying #509

Closed paleolimbot closed 6 days ago

paleolimbot commented 3 weeks ago

This PR implements asychronous buffer copying when copying CUDA buffers. Before this, we had basically been issuing cuMemCopyDtoH/HtoD() a lot of times in a row with a synchronize up front and a synchronize at the end. This was probably not great for performance. Additionally, for copying String/Binary/Large String/Large Binary arrays from CUDA to the CPU, we were issuing very tiny copies on the offsets buffer and synchronizing with the CPU to get the number of bytes to copy for the data buffer.

After this PR, when copying from CPU to CUDA, we will be able to return before the copy is necessarily completed by setting the output sync_event.

When copying from CUDA to CPU, the copy is done in one pass if there are no string/binary arrays, or two passes if there are. When copying string/binary arrays, the implementation walks the entire tree of arrays and issues asynchronous copies for the last offset value. Then, the stream is synchronized with the CPU, and a second set of asynchronous copies are issued for the buffers whose size we now know.

I don't have much experience with CUDA async programming to know if this approach could be simplified (e.g., I do this in two streams, but it might be that one stream is sufficient since perhaps all of the device -> host copies are getting queued against eachother regardless of what stream they're on).

This will be easier to test when (e.g., bigger, non-trivial data) it is wired up to Python.

TODO:

Closes #245.

kkraus14 commented 2 weeks ago

@paleolimbot A couple of higher level thoughts I have here:

paleolimbot commented 2 weeks ago

Thank you for the review!

There are still many TODOs, but it seems like forcing the user to provide a stream to use the helpers makes it easier for nanoarrow to get things right (and simplifies the resource cleanup considerably!).

zeroshade commented 1 week ago

from @kkraus14's comment:

There could be numerous options as far as buffer copying strategies that could be employed here. I.E. copy different buffers on different streams, using a copy kernel, some mix, etc. I don't think it's feasible to capture / implement all of these different options here.

  • Should we allow folks to provide their own implementations for copies? Again, I'm not clear if this is already accomplished here.

I think we should definitely allow folks to provide their own implementations for copies. @paleolimbot how difficult is it for a user to inject their own implementations of the ArrowDevice object or callbacks to use for the copying (instead of using the defaults we are providing in this PR)?

paleolimbot commented 1 week ago

how difficult is it for a user to inject their own implementations of the ArrowDevice object or callbacks to use for the copying

It would probably take another PR or two to get to that point. I think this PR is on the right track, though, in that it implements the asynchronous array copy purely in terms of an asynchronous buffer allocation and an asynchronous buffer copy. The two PRs might be:

For something like cudf that has well-abstracted tools for doing these kinds of things, I might suggest building the buffers separately (e.g., how it currently exports a column to an ArrowArray using the ArrowBufferDeallocator) and using ArrowDeviceArrayInit() to do the final export (as opposed to using this generic array copier). That said, I think the generic array copier is still important (if we want people to build device things, they have to test them/print them somehow).