[FEA] Multi-buffer copy algorithm

jrhemstad commented 3 years ago

I have N input buffers that I want to copy to N output buffers. I could sequentially call cudaMemcpyAsync N times, but in most cases it would be faster to launch a single kernel that performs all N copies.

I think such a primitive would be a good fit as a CUB algorithm.

I imagine the API would be something like:

template <typename InputBufferIt, typename OutputBufferIt, typename BeginSizeIteratorT, typename EndSizeIteratorT>
BatchMemcpy(void *d_temp_storage, size_t &temp_storage_bytes, InputBufferIt first_input_buffer, InputBufferIt last_input_buffer, BeginSizeIteratorT first_buffer_size, OutputBufferIt first_output_buffer){
   static_assert( std::is_pointer_v< std::iterator_traits<InputBufferIt>::value_type > );
   static_assert( std::is_pointer_v< std::iterator_traits<OutputBufferIt>::value_type > );
...
}

There's some issues with this API I haven't figure out yet:

I don't think the input/output can/should be iterators. Like DeviceSegmentedRadixSort, I think the in/out need to be raw pointers. Otherwise, how do you accept multiple iterators of potentially different types? Make the algorithm variadic? Maybe.
The sizes of each buffer is an iterator to allow using something like aligned_size_t, but how do you specify different alignments for each buffer?

alliepiper commented 3 years ago

My initial thoughts:

I'd go with Memcpy instead of Copy to reinforce that this is a bitwise copy between raw memory segments (as opposed to invoking C++ copy operators, etc).
After thinking about this some more, this isn't a Segmented operation, just a batch operation. The CUB Segmented algorithms all assume a single input buffer broken up into segments, indexed by offset. This uses multiple disjoint buffers with associated sizes. So maybe BatchMemcpy?
Use different template types for the begin/in/output size iterators (this has bitten users in the segmented algorithms, see #229 / #291).
What does output_start_sizes represent? Shouldn't everything about output_buffers be deducible from the other args?

I don't think the input/output can/should be iterators. Like DeviceSegmentedRadixSort, I think the in/out need to be raw pointers. Otherwise, how do you accept multiple iterators of potentially different types? Make the algorithm variadic? Maybe.

I agree that the input/output ranges must be memory buffers and not iterators, but ideally the outer dimension could be an iterator and the inner dimension could just be "pointer-like". For example,

std::vector<thrust::device_pointer<int>> input = ...;
std::vector<thrust::device_pointer<int>> output = ...;
BatchMemcpy(..., input.begin(), ..., output.begin(), ...);

should work ideally. If we do support this, we'll need to make sure that we have a good diagnostic when a buffer isn't convertible to a raw pointer.

The sizes of each buffer is an iterator to allow using something like aligned_size_t, but how do you specify different alignments for each buffer?

I may be missing something, but since this is a bitwise memcpy, I don't think alignment matters. The memcpy implementation should determine the best alignment/word size to use for copying, and break up the copies into appropriate chunks.

jrhemstad commented 3 years ago

I like BatchMemcpy.

Use different template types for the begin/in/output size iterators

Done.

What does output_start_sizes represent?

That was a mistake.

ideally the outer dimension could be an iterator and the inner dimension could just be "pointer-like"

Agreed, I think this is easy enough to static_assert with appropriate traits (is_pointer may not be sufficient for Thrust fancy pointers).

I may be missing something, but since this is a bitwise memcpy, I don't think alignment matters.

It matters for getting good performance. In the worse case, the memcpy has to assume 1B alignment and use 1B load/stores, or introspect the pointers to determine the alignment and decide what size load/stores can be used. Introspecting the pointer can generate a lot of extra code that harms perf, so if you can statically specify the alignment, it is much better for perf.

I've updated the issue description based on your feedback.

alliepiper commented 3 years ago

if you can statically specify the alignment, it is much better for perf.

Makes sense.

how do you specify different alignments for each buffer?

I'm not sure there's a good way to do this. If this is for a static optimization, all of the alignments would need to be specified as template parameters. This would be quite a burden, and would require a unique template instantiation of the entire algorithm for each unique set of alignments.

A more feasible compromise might be to add an extra argument that's essentially a std::integral_constant<std::size_t, ALIGN>. ALIGN would specify the alignment of all input/output buffers, and would default to 0 meaning "inspect the pointers". This will require consistent alignments across buffers to achieve the optimization, but would avoid many of the template instantiation issues.

Would that be suitable for your usecase?

alliepiper commented 3 years ago

Alternatively, it might make sense to introduce a tagged pointer type that carries alignment info. It'd still be a headache from a template standpoint, but it would be a nicer interface.

jrhemstad commented 3 years ago

all of the alignments would need to be specified as template parameters. This would be quite a burden, and would require a unique template instantiation of the entire algorithm for each unique set of alignments.

Agreed, that's why I don't think it's really a solvable problem without making the algorithm variadic.

specify the alignment of all input/output buffers

I think this is the only reasonable, non-variadic solution. Though I don't think it requires an extra integral_constant parameter. We can just use cuda::aligned_size_t as the value_type of the Size iterator. Same as what's done for cuda::memcpy_async.

alliepiper commented 3 years ago

We can just use cuda::aligned_size_t as the value_type of the Size iterator.

Good point -- that would be ideal. Since we're adding a libcu++ dependency soon this should be totally doable.

gevtushenko commented 3 years ago

We might consider a generalized version of this API. The original issue looks like this.

It's helpful to have a mapping for ranges within sources and destinations. In this case, we can introduce BatchMemcpyGather and BatchMemcpyScatter facilities.

I suppose a fixed mapping group size per source/destination pair is sufficient. It's equal to 64 bytes for the int32 arrays above.

brycelelbach commented 3 years ago

I'd like to see a few things happen here:

CUB device-level batched data movement kernel (RAPIDS ask).
CUB block-, warp-, and thread-level (a)synchronous batched data movement APIs.
CUB thread-level asynchronous data movement APIs.
- Deploy these within CUB's block- and warp-level synchronous load/store/exchange APIs.
- See #209 for some initial explorations here.
Vectorization for all of the above.

elstehle commented 3 years ago

How do we generally feel about taking an extra parameter (max_total_bytes) that represents an upper bound on the total number of bytes that we expect to be copied (summed over all the buffers' sizes)? This would allow us to request some temp_storage that we could use for load balancing amongst thread blocks.

template <typename InputBufferIt, typename OutputBufferIt, typename BeginSizeIteratorT, typename EndSizeIteratorT, typename OffsetT>
BatchMemcpy(void *d_temp_storage, size_t &temp_storage_bytes, InputBufferIt first_input_buffer, InputBufferIt last_input_buffer, BeginSizeIteratorT first_buffer_size, OutputBufferIt first_output_buffer, OffsetT max_total_bytes){
   static_assert( std::is_pointer_v< std::iterator_traits<InputBufferIt>::value_type > );
   static_assert( std::is_pointer_v< std::iterator_traits<OutputBufferIt>::value_type > );
...
}

Other CUB algorithms currently have num_items as host value. Here we have iterators that can be dereferenced on the device only. In this case, we could compute the temp_storage_bytes based on max_total_bytes.

I expect temp_storage_bytes will be a fraction of the total number of items (e.g., <1% of N). Similarly, we'll be incurring ~1% more memory transfers. I hope that we can get robust runtimes at (close to) peak memory BW for the whole range of batch sizes in exchange.

alliepiper commented 3 years ago

Can you elaborate on what the temp storage is used for in this case?

Could max_total_bytes be optional in case it's not known, or if the user has to handle highly variable loads?

It should be fine to include that as an optimization, but I'd still like to write generic usages where the upper bound is unknown.

jrhemstad commented 3 years ago

Here we have iterators that can be dereferenced on the device only.

Actually, when I first envisioned this API, I was thinking the size iterator would be host accessible. But it's not obvious to me if that's the right decision or not.

elstehle commented 3 years ago

Actually, when I first envisioned this API, I was thinking the size iterator would be host accessible. But it's not obvious to me if that's the right decision or not.

Thanks for clarifying, @jrhemstad. I'm inclined to not make it a requirement that the iterators are accessible from the host as well. Iirc, all iterators in CUB are currently only accessed from the device. I also think that there's use cases where this will be an algorithm that will be called in succession of another algorithm that has previously run on the GPU. If it'd be a requirement to have the size iterator be host-accessible too, then this would imply a cudaDeviceSynchronize between the first algorithm, which was running on the GPU and has generated the buffer sizes as part of its device-side output, and the BatchMemcpy which would now require those sizes to be available on the host. I'd prefer to avoid that.

On another note, I think I have found a viable, load-balanced solution that makes the temp_storage_bytes be linear in the number of buffers rather than linear in the total number of bytes being copied. I'll follow up with the proposal shortly.

elstehle commented 1 year ago

This feature request has been addressed by PR https://github.com/NVIDIA/cub/pull/359 that is now merged.

jakirkham commented 1 year ago

Excited to see this has landed! 🥳

Is the idea still to include this in 2.1.0? If so, when is that release scheduled? Just trying to get an idea for planning purposes. Thanks! 🙏

NVIDIA / cub

[FEA] Multi-buffer copy algorithm #297