LLNL / RAJA

RAJA Performance Portability Layer (C++)
BSD 3-Clause "New" or "Revised" License
490 stars 103 forks source link

New kernel reduction interface #1522

Open mdavis36 opened 1 year ago

mdavis36 commented 1 year ago

The new reduction interface should integrate with kernel through the current kernel_param interface. Reduce arguments will be passed in and the appropriate lambda arguments can be generated in a similar way to how they are generated in the forall interface:

Kernel's statement::Lambda allows arguments be populated implicitly or explicitly depending on how you define the statement::Lambda type. In the implicit case we need to populate lambda objects with all arguments as required by the elements of the kernel_param tuple regardless of the use in the lambda body:

data_t worksum;

using EXEC_POL_I =
RAJA::statement::ForICount<1, RAJA::statement::Param<0>, RAJA::loop_exec,
  RAJA::statement::ForICount<0, RAJA::statement::Param<1>, RAJA::loop_exec,
    RAJA::statement::Lambda<0>
  >
RAJA::statement::ForICount<0, RAJA::statement::Param<1>, RAJA::loop_exec,
  RAJA::statement::ForICount<1, RAJA::statement::Param<0>, RAJA::loop_exec,
    RAJA::statement::Lambda<1>
  >
>;

RAJA::kernel_param<SEQ_EXEC_POL_I>( 
  RAJA::make_tuple((int)0,
                   (int)0,
                   Tile_Array,
                   RAJA::expt::Reduce<RAJA::operator::add>(&worksum)),

    [=](int col, int row, int tx, int ty, TILE_MEM &Tile_Array, Index_type&, data_t& m_worksum)
    { ... }, // This lambda does reduction work

    [=](int col, int row, int tx, int ty, TILE_MEM &Tile_Array, Index_type&, data_t& m_worksum)
    { ... } // This lambda does NOT do reduction work.
  );

RAJA::Kernel Also allows for explicit argument definitions within a statement::Lambda type:

data_t worksum = 0;

using EXEC_POL =

  RAJA::statement::For<1, RAJA::loop_exec,
    RAJA::statement::For<0, RAJA::loop_exec,
      RAJA::statement::Lambda<0, Segs<0>, Segs<1>, Offsets<0>, Offsets<1>, Params<0>, Params<1> >
    >
  >
  RAJA::statement::For<0, RAJA::loop_exec,
    RAJA::statement::For<1, RAJA::loop_exec,
      RAJA::statement::Lambda<1, Segs<0, 1>, Offsets<0, 1>, Params<0> >
    >
  >  

RAJA::kernel_param<EXEC_POL>( 
  RAJA::make_tuple(Tile_Array,
                   RAJA::expt::Reduce<RAJA::operator::add>(&worksum)),

  [=](int col, int row, int tx, int ty, TILE_MEM &Tile_Array, data_t& m_worksum) {
    ...
  },

  [=](int col, int row, int tx, int ty, TILE_MEM &Tile_Array) {
    ...
  }
);
rchen20 commented 1 year ago

Hey @mdavis36, in the implicit lambda case, are there typos where data_t m_red ought to be data_t & worksum? If so, is this implying that we need to pass the reduced data to each lambda, regardless of whether that lambda actually performs a reduction?

mdavis36 commented 1 year ago

@rchen20 Updated the example above, the lambda argument itself is m_worksum, the target for the final reduction result is worksum. These should be different. m_worksum is the thread local value to be used before the actual reduction work is done later.

rcarson3 commented 12 months ago

@mdavis36 if I'm reading the above would this essentially collapse all the various different reduction types (e.g. `RAJA::ReduceSum<RAJA::seq_reduce, int> RAJA::ReduceSum<RAJA::omp_reduce_ordered, int> RAJA::ReduceSum<RAJA::cuda_reduce, int>, etc...) down to one single type ? So, you would only need 1 data type for all your different execution policies?

If so I'd just like to say that I'd be very much for such a feature as forall loops of mine that have those operations are the only ones I can't abstract away to a single forall abstraction using something like raja::expt::dynamic_forall feature for all the execution policies I support in my libraries/apps (cpu, openmp, cuda, hip, etc...).

Unfortunately, things like std::variant or std::visit still are not supported on the device, at least to my current knowledge of things, which would have allowed a simple-ish solution to the above.