anstaf commented 3 years ago

Motivation

We need reductions over 3D data. The primary use case is a scalar product of floating point numbers. Implementation should be parallel and available for both GPU and CPU. Effective GPU implementation is the most important.

Analysis

Let us take NVIDIA provided parallel reduction algorithm as a basis. It is fast for CUDA GPUs and well optimized. There are two significant features of that algorithm:

it is destructive (it mutates the input data)
it accepts an 1D array with the size of power of 2 as an input This means that there is no way to do reduction on the fly -- we need to maintain a dedicated storage that can not be used for anything else (because of destructive nature of the algorithm). Additionally this storage should be padded to the power of two size and padding should be filled with the reduction initial value.

Proposal

Let us add the concept of Reducible. Reducible should model Sid (which reference_type is non const l-value reference ) and additionally should have T sid_reduce(Reducible&&) function available by ADL. This is a host function. It will be at least two models of Reducible -- for GPU and for CPU. Mind the && in the sid_reduce signature -- it accepts only r-value. This reflects the destructive nature of the algorithm. Reduction is the last thing you can do with the data. After it should be thrown away.

Usage Example

// we use some stencil computation to put the data into reducible.
struct copy_functor {
    using in =in_accessor<0>;
    using out = inout_accessor<1>;
    using param_list = make_param_list<in, out>;

    template <class Eval>
    GT_FUNCTION static void apply(Eval &&eval) {  eval(out()) = eval(in()); }
};
...
// some fields are regular ones
auto in = storage::builder<storage::gpu>.type<double>().dimensions(100, 100, 100).value(1)();

// here we create a reducible (API is not decided yet)
// here we must specify  reduction function 
auto out = make_reducible<sum, gpu>.type<double>().dimensions(100,100,100)();

// run our stencil as usual 
run_single_stage(copy_functor(), stencil::gpu(), grid, in, out);

// finally run reduce. mind std::move here.
auto res  = sid::reduce(std::move(out));

Details

When reducible is created, memory allocation should be padded to the power of two. All memory (including paddings) should be filled by initial value. For the case of sum reduction the filling can be done effectively by cudaMemset. For other reductions we need to launch a separate kernel at this point. Memory allocation could be maintained by sid::cached_allocator mechanizm

fthaler commented 3 years ago

Implementations from CUB (https://nvlabs.github.io/cub/) or Thrust (https://github.com/NVIDIA/thrust/, nowadays probably based on CUB) might also be useful, just in case you haven‘t checked them out yet.

havogt commented 3 years ago

I'll drop some comments, here we can discuss in more detail via vc or I can expand later:

You mention scalar product as the primary use case, but it's not reflected in the API. Should a stencil do the element multiplication?
Would there be a performance loss and how big is it for non-destructive reductions?
The reductions are typically in a loop, so setting up the reducible needs to be cheap. -> make_reducible must not allocate, but also re-initializing padding area (if destroyed during reduction) is not for free. Should be compared with non-destructive algorithm.

anstaf commented 3 years ago

@havogt Answering in the order:

Yes. The idea is that for scalar product multiplication is done by stencil, probably merged with other unrelated to reduction stencils.
Frankly I don't want to develop my own gpu parallel reduction algorithm here. If you supply me with the links to non-destructive algorithms, I can do comparisons.
Reinitializing price can be reduced to a single memcopy call (probably from device if it is cheaper). For the sum reduction it is even memset. My intuition is that it should be order of magnitude less then the price of the loop body + reduction itself.

havogt commented 3 years ago

Frankly I don't want to develop my own gpu parallel reduction algorithm here. If you supply me with the links to non-destructive algorithms, I can do comparisons.

Actually the reductions from the cuda samples are non-destructive (https://github.com/NVIDIA/cuda-samples/tree/master/Samples/reduction). Which examples did you look at?

anstaf commented 3 years ago

Frankly I don't want to develop my own gpu parallel reduction algorithm here. If you supply me with the links to non-destructive algorithms, I can do comparisons.

Actually the reductions from the cuda samples are non-destructive (https://github.com/NVIDIA/cuda-samples/tree/master/Samples/reduction). Which examples did you look at?

Indeed!! I was fooled by the signatures without const modifiers. It changes everything.

mbianco commented 3 years ago

Let me see if I understand. Having the reduction on a sid would mean that, if I want to implement a dot-product of two set of values I need to zip the two sets into a single sid, then apply a map operation to transform the pairs of the zip into scalars, and then reduce. Something like dorproduct

mbianco commented 3 years ago

Otherwise I do not see how to avoid the stencil... and the reduction on a sid is not composed with the stencils, anyway, right?

havogt commented 3 years ago

I agree.

For me the most intuitive API would be the following (modulo syntactic sugar):

template<class T, class BinaryOp, class Zipper, class... Sids>
T reduce(T init, BinaryOp binary_op, Zipper zipper, Sids&&... sids);

E.g. for the dot product

reduce(0, std::plus{}, std::multiplies{}, v1, v2);

mbianco commented 3 years ago

but then reduce is not a member function of sid, right?

anstaf commented 3 years ago

We had a zoom discussion with @havogt. Here is the aftermath:

We agreed that the Sid concept is not related to the reduction task. Sid serves the input/output of the stencil calculus and represents N-dimensional data with the contiguous memory allocation. Reduction deals with the one dimensional data, no strides are needed to iterate over it. To do effective parallel reduction we probably need the input to be contiguous (without paddings in the middle) and with the size aligned to the power of two (to some extent).

An API that is proposed by @havogt will not work as is because Sids don't hold the information about the computation area and even about the dimensionality of the computation area.

We need a separate concept for an input of the reduction. Let us name it Reducible. We also need to somehow marry stencil and reduction computations in GT. The sufficient requirement for that is to have at least one (template) type that models both Sid and Reducible. I.e. we create that thing; fill it with the data using our stencil interface (because it models Sid) and reduce it to a single value (because it models Reducible).

It looks like this solution will cover all reduction use cases that we currently have in mind.

However ideally we want all our current Sid models (DataStores, python protocol buffers and C-arrays) to model Reducible as well. But this can not be done for free. -- Parallel reduction implementation became more complex and slow if we deal with the paddings in between and if the data size that is not aligned to the power of two.

We agreed on the following plan:

First I am writing a special class that is Reducible and a Sid and the cuda reduce implementation using that.
Next step is to soften Reducible requirements such that it would be possible for DataStore and friends to model it. Those relaxed Reducibles will use more complicated and slow cuda reduce algorithm. At this point we can compare the performance of both solutions.

anstaf commented 3 years ago

Implementations from CUB (https://nvlabs.github.io/cub/) or Thrust (https://github.com/NVIDIA/thrust/, nowadays probably based on CUB) might also be useful, just in case you haven‘t checked them out yet. @fthaler I have had a look at them. For the record:

Thrust implementation of the cuda reduction delegates to CUB : https://github.com/NVIDIA/thrust/blob/main/thrust/system/cuda/detail/reduce.h#L945

The CUB source code is rather verbose, it is full of pre C++11 template metaprogramming and it is hard to follow what is going on. From my first look (here: https://github.com/NVlabs/cub/blob/main/cub/device/dispatch/dispatch_reduce.cuh#L71 and here: https://github.com/NVlabs/cub/blob/main/cub/agent/agent_reduce.cuh#L100) it doesn't use warp level intrinsics like the most advanced reduce kernel from the cuda samples does.

fthaler commented 3 years ago

CUB actually does use intrinsics and even inline-PTX, for example here: https://github.com/NVlabs/cub/blob/618a46c27764f0e0b86fb3643a572ed039180ad8/cub/warp/specializations/warp_reduce_shfl.cuh#L146. But the code is indeed full of pre-C++11 boilerplate and quite annoying to follow… The advantage of CUB would be that we do not have to update the reduction code for every new hardware, but NVIDIA would do it for us. And there is even hipCUB, so the same for AMD: https://github.com/ROCmSoftwarePlatform/hipCUB (based on https://github.com/ROCmSoftwarePlatform/rocPRIM).

tehrengruber commented 3 years ago

As mentioned today in the standup the ability to compute multiple dot products at once can save a lot of memory bandwidth, so here are two pieces from a toy elliptic solver that could serve as a test case, where L(r) is a stencil computation:

β = - np.dot(r, r)/np.dot(r, L(r))  #  steepest descent
β = - np.dot(r, L(r))/np.dot(L(r), L(r))  # minimum residual

mbianco commented 3 years ago

If this is the syntax, we have some work to reorder it and fuse operations to make one reduction only.

  x,y,z = zip_dot((r, r, L(r)), (r, L(r), L(r))
  a = x/y;
  b = y/z

Alternatively we provide just the interfaces for zipping the reductions and now it's user responsibility to get it efficient... What do we choose?

havogt commented 3 years ago

Implemented in #1590 and #1594.

GridTools / gridtools

Gridtools Reduction Design #1587

Motivation

Analysis

Proposal

Usage Example

Details