GridTools / gridtools

Libraries and utilities to develop performance portable applications for weather and climate.
https://gridtools.github.io/gridtools
Other
61 stars 21 forks source link

Gridtools Reduction Design #1587

Closed anstaf closed 3 years ago

anstaf commented 3 years ago

Motivation

We need reductions over 3D data. The primary use case is a scalar product of floating point numbers. Implementation should be parallel and available for both GPU and CPU. Effective GPU implementation is the most important.

Analysis

Let us take NVIDIA provided parallel reduction algorithm as a basis. It is fast for CUDA GPUs and well optimized. There are two significant features of that algorithm:

Proposal

Let us add the concept of Reducible. Reducible should model Sid (which reference_type is non const l-value reference ) and additionally should have T sid_reduce(Reducible&&) function available by ADL. This is a host function. It will be at least two models of Reducible -- for GPU and for CPU. Mind the && in the sid_reduce signature -- it accepts only r-value. This reflects the destructive nature of the algorithm. Reduction is the last thing you can do with the data. After it should be thrown away.

Usage Example

// we use some stencil computation to put the data into reducible.
struct copy_functor {
    using in =in_accessor<0>;
    using out = inout_accessor<1>;
    using param_list = make_param_list<in, out>;

    template <class Eval>
    GT_FUNCTION static void apply(Eval &&eval) {  eval(out()) = eval(in()); }
};
...
// some fields are regular ones
auto in = storage::builder<storage::gpu>.type<double>().dimensions(100, 100, 100).value(1)();

// here we create a reducible (API is not decided yet)
// here we must specify  reduction function 
auto out = make_reducible<sum, gpu>.type<double>().dimensions(100,100,100)();

// run our stencil as usual 
run_single_stage(copy_functor(), stencil::gpu(), grid, in, out);

// finally run reduce. mind std::move here.
auto res  = sid::reduce(std::move(out));

Details

When reducible is created, memory allocation should be padded to the power of two. All memory (including paddings) should be filled by initial value. For the case of sum reduction the filling can be done effectively by cudaMemset. For other reductions we need to launch a separate kernel at this point. Memory allocation could be maintained by sid::cached_allocator mechanizm

fthaler commented 3 years ago

Implementations from CUB (https://nvlabs.github.io/cub/) or Thrust (https://github.com/NVIDIA/thrust/, nowadays probably based on CUB) might also be useful, just in case you haven‘t checked them out yet.

havogt commented 3 years ago

I'll drop some comments, here we can discuss in more detail via vc or I can expand later:

anstaf commented 3 years ago

@havogt Answering in the order:

havogt commented 3 years ago

Frankly I don't want to develop my own gpu parallel reduction algorithm here. If you supply me with the links to non-destructive algorithms, I can do comparisons.

Actually the reductions from the cuda samples are non-destructive (https://github.com/NVIDIA/cuda-samples/tree/master/Samples/reduction). Which examples did you look at?

anstaf commented 3 years ago

Frankly I don't want to develop my own gpu parallel reduction algorithm here. If you supply me with the links to non-destructive algorithms, I can do comparisons.

Actually the reductions from the cuda samples are non-destructive (https://github.com/NVIDIA/cuda-samples/tree/master/Samples/reduction). Which examples did you look at?

Indeed!! I was fooled by the signatures without const modifiers. It changes everything.

mbianco commented 3 years ago

Let me see if I understand. Having the reduction on a sid would mean that, if I want to implement a dot-product of two set of values I need to zip the two sets into a single sid, then apply a map operation to transform the pairs of the zip into scalars, and then reduce. Something like dorproduct

mbianco commented 3 years ago

Otherwise I do not see how to avoid the stencil... and the reduction on a sid is not composed with the stencils, anyway, right?

havogt commented 3 years ago

I agree.

For me the most intuitive API would be the following (modulo syntactic sugar):

template<class T, class BinaryOp, class Zipper, class... Sids>
T reduce(T init, BinaryOp binary_op, Zipper zipper, Sids&&... sids);

E.g. for the dot product

reduce(0, std::plus{}, std::multiplies{}, v1, v2);
mbianco commented 3 years ago

but then reduce is not a member function of sid, right?

anstaf commented 3 years ago

We had a zoom discussion with @havogt. Here is the aftermath:

We agreed that the Sid concept is not related to the reduction task. Sid serves the input/output of the stencil calculus and represents N-dimensional data with the contiguous memory allocation. Reduction deals with the one dimensional data, no strides are needed to iterate over it. To do effective parallel reduction we probably need the input to be contiguous (without paddings in the middle) and with the size aligned to the power of two (to some extent).

An API that is proposed by @havogt will not work as is because Sids don't hold the information about the computation area and even about the dimensionality of the computation area.

We need a separate concept for an input of the reduction. Let us name it Reducible. We also need to somehow marry stencil and reduction computations in GT. The sufficient requirement for that is to have at least one (template) type that models both Sid and Reducible. I.e. we create that thing; fill it with the data using our stencil interface (because it models Sid) and reduce it to a single value (because it models Reducible).

It looks like this solution will cover all reduction use cases that we currently have in mind.

However ideally we want all our current Sid models (DataStores, python protocol buffers and C-arrays) to model Reducible as well. But this can not be done for free. -- Parallel reduction implementation became more complex and slow if we deal with the paddings in between and if the data size that is not aligned to the power of two.

We agreed on the following plan:

anstaf commented 3 years ago

Implementations from CUB (https://nvlabs.github.io/cub/) or Thrust (https://github.com/NVIDIA/thrust/, nowadays probably based on CUB) might also be useful, just in case you haven‘t checked them out yet. @fthaler I have had a look at them. For the record:

fthaler commented 3 years ago

CUB actually does use intrinsics and even inline-PTX, for example here: https://github.com/NVlabs/cub/blob/618a46c27764f0e0b86fb3643a572ed039180ad8/cub/warp/specializations/warp_reduce_shfl.cuh#L146. But the code is indeed full of pre-C++11 boilerplate and quite annoying to follow… The advantage of CUB would be that we do not have to update the reduction code for every new hardware, but NVIDIA would do it for us. And there is even hipCUB, so the same for AMD: https://github.com/ROCmSoftwarePlatform/hipCUB (based on https://github.com/ROCmSoftwarePlatform/rocPRIM).

tehrengruber commented 3 years ago

As mentioned today in the standup the ability to compute multiple dot products at once can save a lot of memory bandwidth, so here are two pieces from a toy elliptic solver that could serve as a test case, where L(r) is a stencil computation:

β = - np.dot(r, r)/np.dot(r, L(r))  #  steepest descent
β = - np.dot(r, L(r))/np.dot(L(r), L(r))  # minimum residual
mbianco commented 3 years ago

If this is the syntax, we have some work to reorder it and fuse operations to make one reduction only.

  x,y,z = zip_dot((r, r, L(r)), (r, L(r), L(r))
  a = x/y;
  b = y/z

Alternatively we provide just the interfaces for zipping the reductions and now it's user responsibility to get it efficient... What do we choose?

havogt commented 3 years ago

Implemented in #1590 and #1594.