Closed anstaf closed 3 years ago
Implementations from CUB (https://nvlabs.github.io/cub/) or Thrust (https://github.com/NVIDIA/thrust/, nowadays probably based on CUB) might also be useful, just in case you haven‘t checked them out yet.
I'll drop some comments, here we can discuss in more detail via vc or I can expand later:
make_reducible
must not allocate, but also re-initializing padding area (if destroyed during reduction) is not for free. Should be compared with non-destructive algorithm.@havogt Answering in the order:
memcopy
call (probably from device if it is cheaper). For the sum reduction it is even memset
. My intuition is that it should be order of magnitude less then the price of the loop body + reduction itself. Frankly I don't want to develop my own gpu parallel reduction algorithm here. If you supply me with the links to non-destructive algorithms, I can do comparisons.
Actually the reductions from the cuda samples are non-destructive (https://github.com/NVIDIA/cuda-samples/tree/master/Samples/reduction). Which examples did you look at?
Frankly I don't want to develop my own gpu parallel reduction algorithm here. If you supply me with the links to non-destructive algorithms, I can do comparisons.
Actually the reductions from the cuda samples are non-destructive (https://github.com/NVIDIA/cuda-samples/tree/master/Samples/reduction). Which examples did you look at?
Indeed!! I was fooled by the signatures without const
modifiers. It changes everything.
Let me see if I understand. Having the reduction on a sid would mean that, if I want to implement a dot-product of two set of values I need to zip the two sets into a single sid, then apply a map
operation to transform the pairs of the zip into scalars, and then reduce. Something like
Otherwise I do not see how to avoid the stencil... and the reduction on a sid is not composed with the stencils, anyway, right?
I agree.
For me the most intuitive API would be the following (modulo syntactic sugar):
template<class T, class BinaryOp, class Zipper, class... Sids>
T reduce(T init, BinaryOp binary_op, Zipper zipper, Sids&&... sids);
E.g. for the dot product
reduce(0, std::plus{}, std::multiplies{}, v1, v2);
but then reduce is not a member function of sid, right?
We had a zoom discussion with @havogt. Here is the aftermath:
We agreed that the Sid
concept is not related to the reduction task. Sid
serves the input/output of the stencil calculus and represents N-dimensional data with the contiguous memory allocation. Reduction deals with the one dimensional data, no strides are needed to iterate over it. To do effective parallel reduction we probably need the input to be contiguous (without paddings in the middle) and with the size aligned to the power of two (to some extent).
An API that is proposed by @havogt will not work as is because Sid
s don't hold the information about the computation area and even about the dimensionality of the computation area.
We need a separate concept for an input of the reduction. Let us name it Reducible
. We also need to somehow marry stencil and reduction computations in GT. The sufficient requirement for that is to have at least one (template) type that models both Sid
and Reducible
. I.e. we create that thing; fill it with the data using our stencil interface (because it models Sid
) and reduce it to a single value (because it models Reducible
).
It looks like this solution will cover all reduction use cases that we currently have in mind.
However ideally we want all our current Sid
models (DataStore
s, python protocol buffer
s and C-array
s) to model Reducible
as well. But this can not be done for free. -- Parallel reduction implementation became more complex and slow if we deal with the paddings in between and if the data size that is not aligned to the power of two.
We agreed on the following plan:
Reducible
and a Sid
and the cuda reduce implementation using that.Reducible
requirements such that it would be possible for DataStore
and friends to model it. Those relaxed Reducible
s will use more complicated and slow cuda reduce algorithm. At this point we can compare the performance of both solutions. Implementations from CUB (https://nvlabs.github.io/cub/) or Thrust (https://github.com/NVIDIA/thrust/, nowadays probably based on CUB) might also be useful, just in case you haven‘t checked them out yet. @fthaler I have had a look at them. For the record:
Thrust
implementation of the cuda reduction delegates toCUB
: https://github.com/NVIDIA/thrust/blob/main/thrust/system/cuda/detail/reduce.h#L945- The
CUB
source code is rather verbose, it is full of pre C++11 template metaprogramming and it is hard to follow what is going on. From my first look (here: https://github.com/NVlabs/cub/blob/main/cub/device/dispatch/dispatch_reduce.cuh#L71 and here: https://github.com/NVlabs/cub/blob/main/cub/agent/agent_reduce.cuh#L100) it doesn't use warp level intrinsics like the most advanced reduce kernel from the cuda samples does.
CUB actually does use intrinsics and even inline-PTX, for example here: https://github.com/NVlabs/cub/blob/618a46c27764f0e0b86fb3643a572ed039180ad8/cub/warp/specializations/warp_reduce_shfl.cuh#L146. But the code is indeed full of pre-C++11 boilerplate and quite annoying to follow… The advantage of CUB would be that we do not have to update the reduction code for every new hardware, but NVIDIA would do it for us. And there is even hipCUB, so the same for AMD: https://github.com/ROCmSoftwarePlatform/hipCUB (based on https://github.com/ROCmSoftwarePlatform/rocPRIM).
As mentioned today in the standup the ability to compute multiple dot products at once can save a lot of memory bandwidth, so here are two pieces from a toy elliptic solver that could serve as a test case, where L(r)
is a stencil computation:
β = - np.dot(r, r)/np.dot(r, L(r)) # steepest descent
β = - np.dot(r, L(r))/np.dot(L(r), L(r)) # minimum residual
If this is the syntax, we have some work to reorder it and fuse operations to make one reduction only.
x,y,z = zip_dot((r, r, L(r)), (r, L(r), L(r))
a = x/y;
b = y/z
Alternatively we provide just the interfaces for zipping the reductions and now it's user responsibility to get it efficient... What do we choose?
Implemented in #1590 and #1594.
Motivation
We need reductions over 3D data. The primary use case is a scalar product of floating point numbers. Implementation should be parallel and available for both GPU and CPU. Effective GPU implementation is the most important.
Analysis
Let us take NVIDIA provided parallel reduction algorithm as a basis. It is fast for CUDA GPUs and well optimized. There are two significant features of that algorithm:
Proposal
Let us add the concept of
Reducible
.Reducible
should modelSid
(whichreference_type
is non const l-value reference ) and additionally should haveT sid_reduce(Reducible&&)
function available by ADL. This is a host function. It will be at least two models ofReducible
-- for GPU and for CPU. Mind the&&
in thesid_reduce
signature -- it accepts only r-value. This reflects the destructive nature of the algorithm. Reduction is the last thing you can do with the data. After it should be thrown away.Usage Example
Details
When reducible is created, memory allocation should be padded to the power of two. All memory (including paddings) should be filled by initial value. For the case of sum reduction the filling can be done effectively by
cudaMemset
. For other reductions we need to launch a separate kernel at this point. Memory allocation could be maintained bysid::cached_allocator
mechanizm