feat(extensions/nanoarrow_device): Implement asynchronous buffer copying

paleolimbot commented 3 weeks ago

This PR implements asychronous buffer copying when copying CUDA buffers. Before this, we had basically been issuing cuMemCopyDtoH/HtoD() a lot of times in a row with a synchronize up front and a synchronize at the end. This was probably not great for performance. Additionally, for copying String/Binary/Large String/Large Binary arrays from CUDA to the CPU, we were issuing very tiny copies on the offsets buffer and synchronizing with the CPU to get the number of bytes to copy for the data buffer.

After this PR, when copying from CPU to CUDA, we will be able to return before the copy is necessarily completed by setting the output sync_event.

When copying from CUDA to CPU, the copy is done in one pass if there are no string/binary arrays, or two passes if there are. When copying string/binary arrays, the implementation walks the entire tree of arrays and issues asynchronous copies for the last offset value. Then, the stream is synchronized with the CPU, and a second set of asynchronous copies are issued for the buffers whose size we now know.

I don't have much experience with CUDA async programming to know if this approach could be simplified (e.g., I do this in two streams, but it might be that one stream is sufficient since perhaps all of the device -> host copies are getting queued against eachother regardless of what stream they're on).

This will be easier to test when (e.g., bigger, non-trivial data) it is wired up to Python.

TODO:

Implement sync_event integration (both for source and destination)
Test more than just a few string arrays

Closes #245.

kkraus14 commented 2 weeks ago

@paleolimbot A couple of higher level thoughts I have here:

Creating and destroying CUDA streams / events is unfortunately a bit expensive. See here for a benchmark on CUDA event creation and destruction. I've seen CUDA stream / event pools implemented in libraries / frameworks for this reason.
- We may want to make the constructors / destructors pluggable in some way for someone to provide their own pool implementation? I'm not clear if this is already accomplished here.
There could be numerous options as far as buffer copying strategies that could be employed here. I.E. copy different buffers on different streams, using a copy kernel, some mix, etc. I don't think it's feasible to capture / implement all of these different options here.
- Should we allow folks to provide their own implementations for copies? Again, I'm not clear if this is already accomplished here.

paleolimbot commented 2 weeks ago

Thank you for the review!

There are still many TODOs, but it seems like forcing the user to provide a stream to use the helpers makes it easier for nanoarrow to get things right (and simplifies the resource cleanup considerably!).

zeroshade commented 1 week ago

from @kkraus14's comment:

There could be numerous options as far as buffer copying strategies that could be employed here. I.E. copy different buffers on different streams, using a copy kernel, some mix, etc. I don't think it's feasible to capture / implement all of these different options here.

Should we allow folks to provide their own implementations for copies? Again, I'm not clear if this is already accomplished here.

I think we should definitely allow folks to provide their own implementations for copies. @paleolimbot how difficult is it for a user to inject their own implementations of the ArrowDevice object or callbacks to use for the copying (instead of using the defaults we are providing in this PR)?

paleolimbot commented 1 week ago

how difficult is it for a user to inject their own implementations of the ArrowDevice object or callbacks to use for the copying

It would probably take another PR or two to get to that point. I think this PR is on the right track, though, in that it implements the asynchronous array copy purely in terms of an asynchronous buffer allocation and an asynchronous buffer copy. The two PRs might be:

Better separate the allocation and copying steps in ArrowBufferInit()
Let one or more methods accept a custom buffer copier/allocator
Better support splitting up children of a device array (if somebody had more than one stream they had available to use, one might suggest copying children in parallel but right now you might have to do some fudging to get all the pieces in the right place).

For something like cudf that has well-abstracted tools for doing these kinds of things, I might suggest building the buffers separately (e.g., how it currently exports a column to an ArrowArray using the ArrowBufferDeallocator) and using ArrowDeviceArrayInit() to do the final export (as opposed to using this generic array copier). That said, I think the generic array copier is still important (if we want people to build device things, they have to test them/print them somehow).

apache / arrow-nanoarrow

feat(extensions/nanoarrow_device): Implement asynchronous buffer copying #509