CExA-project / dynk

Library to run parallel regions on the device or on the host dynamically
MIT License
0 stars 0 forks source link

DualView approach for runtime dispatch decision #1

Open crtrott opened 1 week ago

crtrott commented 1 week ago

Since Peter brought that up in the slack here is a sketch of what I came up with using DualView as the fundamental data management approach:

// Generic helpers:
template<class Lambda>
void dynamic_parallel_for(std::string label, bool run_on_device, size_t N, Lambda lambda) {
  if(run_on_device) {
     parallel_for(label, RangePolicy<DefaultExecutionSpace>(0,N), lamdba);
  } else {
     parallel_for(label, RangePolicy<DefaultHostExecutionSpace>(0,N), lamdba);
  } 
}

template<class DV>
auto choose_side_read(bool device_side, DV a) {
  View<DV::const_data_type, DV::layout, AnonymousSpace, DV::memory_traits> tmp;
  if(device_side) {
    a.sync_device();
    tmp = a.d_view;
  } else {
    a.sync_host();
    tmp = a.h_view;
  }
  return tmp;
}

template<class DV>
auto choose_side_modify(bool device_side, DV a) {
  View<DV::data_type, DV::layout, AnonymousSpace, DV::memory_traits> tmp;
  if(device_side) {
    a.sync_device();
    a.modify_device();
    tmp = a.d_view;
  } else {
    a.sync_host();
    a.modify_host();
    tmp = a.h_view;
  }
  return tmp;
}

// User code
void foo(DualView<const double*> a_in, DualView<double*> b_in) {
  bool run_on_device = !a.need_sync_device();
  auto a = choose_side_read(run_on_device, a_in);
  auto b = choose_side_modify(run_on_device, b_in);
  dynamic_parallel_for("KernelName", run_on_device, a_tmp.extent(0), KOKKOS_LAMBDA(int i) {
    b(i) += a(i);
  });
}
jbigot commented 1 week ago

We played with that, but as a result, the lambda is a template on the type of the view it captures, and IIRC, @pzehner had an issue with that & Cuda.

pzehner commented 1 week ago

Since Peter brought that up in the slack

I guess you meant Paul...

The approach you came up with is very similar with the "layer" approach we tried (@jbigot not the one using the templated lambdas). It works well, but requires to implement an extra layer for parallel_* (which means to maintain this interface), and to recreate the various execution policies (which means more stuff to maintain).

The choose_side_read/_write strategy you proposed is interesting.