LLNL / RAJA

RAJA Performance Portability Layer (C++)
BSD 3-Clause "New" or "Revised" License
464 stars 102 forks source link

Need array of reduction variables #246

Open tepperly opened 7 years ago

tepperly commented 7 years ago

I have data in a 3-D array indexed by time and two spatial dimensions. When running, I need to periodically report data reduced from this 3-D array. The problem definition can specify an arbitrary number of reference regions where separate statistics need to be calculated. Each reference region defines a subset of the spatial domain where the end user wants a separate set of summary statistics calculated. The number of reference regions is not known at compile time, so I would like to make an array of ReduceSum<> variables at runtime. This doesn't seem possible because ReduceSum has an explicit constructor, and I can't figure a way to make a container (array, vector, ...) of ReduceSum variables.

tepperly commented 7 years ago

Would placement new work? I could allocate a "large enough" hunk of memory and then individually construct the ReduceSum objects with a placement new.

willkill07 commented 7 years ago

Something like this class I just rolled up appears to work:

template <typename ReduceType>
struct ReduceArray {
  ReduceType* data;
  unsigned int size;

  template <typename T>
  explicit ReduceArray (unsigned int count, T value)
    : data (reinterpret_cast<ReduceType*>(malloc(sizeof(ReduceType) * count))),
      size (count) {
    for (unsigned int i = 0; i < count; ++i) {
      data[i] = std::move(ReduceType(value));
    }
  }

  ReduceArray (ReduceArray const & other)
    : data (reinterpret_cast<ReduceType*>(malloc(sizeof(ReduceType) * other.size))),
      size (other.size) {
    for (unsigned int i = 0; i < size; ++i) {
      data[i] = std::move(ReduceType(other.data[i]));
    }
  }

  ~ReduceArray () {
    for (unsigned int i = 0; i < size; ++i) {
      data[i].~ReduceType();
    }
    free(data);
  }

  ReduceType& operator[] (int i) const {
    return data[i];
  }
};

Example:

#include <vector>
#include <algorithm>
#include <numeric>
#include <iostream>
#include <utility>

#include <RAJA/RAJA.hpp>

using exec_pol = RAJA::omp_parallel_for_exec;
using red_pol = RAJA::omp_reduce;

int main() {
  int reducerCount;
  int N;
  std::cin >> N >> reducerCount;
  ReduceArray<RAJA::ReduceSum<red_pol, double>> reducers(reducerCount, 0.0);
  double* a = new double[N];
  std::iota(a, a + N, 0);
  std::transform(a, a + N, a, [=](double v) { return v / N; });
  RAJA::forall<exec_pol>(0, N, [=] (int i) {
    int reducerId = i % reducerCount;
    reducers[reducerId] += a[i];
  });
  for (int i = 0; i < reducerCount; ++i)
    std::cout << reducers[i].get() << std::endl;
  return EXIT_SUCCESS;
}

Output with parameters 1000000 10 yields the expected result across various number of openmp threads (1 - 20):

49999.5
49999.6
49999.7
49999.8
49999.9
50000
50000.1
50000.2
50000.3
50000.4
willkill07 commented 7 years ago

Ultimately, you'd want to use placement new / delete instead of my malloc/free monstrosity, but I think you get the idea

willkill07 commented 7 years ago

Here's a better implementation of ReduceArray which may allow me to sleep better at night:


#include <memory>

template <typename ReduceType>
class ReduceArray {
  constexpr ReduceType* alloc(unsigned int count) noexcept {
    return static_cast<ReduceType*>(::operator new(sizeof(ReduceType) * count));
  }

public:

  template <typename T>
  explicit ReduceArray(unsigned int count, T value) noexcept
  : data(alloc(count)), size(count) {
    for (unsigned int i = 0; i < size; ++i)
      new (std::addressof(data[i])) ReduceType(value);
  }

  ReduceArray(ReduceArray const& other) noexcept
  : data(alloc(other.size)), size(other.size) {
    for (unsigned int i = 0; i < size; ++i)
      new (std::addressof(data[i])) ReduceType(other.data[i]);
  }

  ~ReduceArray() noexcept {
    for (unsigned int i = 0; i < size; ++i)
      data[i].~ReduceType();
  }

  ReduceType& operator[] (unsigned int i) const noexcept {
    return data[i];
  }

private:
  struct Deleter {
    void operator()(ReduceType* p) const noexcept {
      delete p;
    }
  };
  std::unique_ptr<ReduceType[], Deleter> data;
  unsigned int size;
};
ajkunen commented 7 years ago

This is a common pattern we see in transport…. A large number of our kernels are dimensionally reductive. Most of the time they take some N-dimensional array and produce an N-1-dimensional array.

Sometime the N-1 dimensions are embarrassingly parallel, and other times they are not.

There are a couple of things that make this hard… and one of the biggies are GPU’s.

For example: If I have N-1 independent reductions, how do I map this to CUDA? I can have N-1 subsets of threads doing their own reductions, but I don’t want each thread involved in N-1 reductions.

Here is an example kernel in pseudo code: (which happens to be a common pattern in ARDRA)

Given phi[zone][group] and w[group] produce spectrum[group]:

foreach zone: foreach group: spectrum[group] += phi[zone][group] * w[group]

We might have 10k zones, and 1k groups… so it would be nice to have ~100k threads for a CUDA kernel, but you don’t want each thread participating in 1k reductions. Also, you don’t want to launch 1k kernels each with their own reduction.

Does anyone have any ideas of how to engineer this? I’m not sure this is exactly what Tom wanted, but I think it would be generally useful.

-Adam

From: Will Killian notifications@github.com Reply-To: LLNL/RAJA reply@reply.github.com Date: Tuesday, May 16, 2017 at 7:44 AM To: LLNL/RAJA RAJA@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [LLNL/RAJA] Need array of reduction variables (#246)

Here's a better implementation of ReduceArray which may allow me to sleep better at night:

include

template

class ReduceArray {

constexpr ReduceType* alloc(unsigned int count) noexcept {

return static_cast<ReduceType*>(::operator new(sizeof(ReduceType) * count));

}

public:

template

explicit ReduceArray(unsigned int count, T value) noexcept

: data(alloc(count)), size(count) {

for (unsigned int i = 0; i < size; ++i)

  new (std::addressof(data[i])) ReduceType(value);

}

ReduceArray(ReduceArray const& other) noexcept

: data(alloc(other.size)), size(other.size) {

for (unsigned int i = 0; i < size; ++i)

  new (std::addressof(data[i])) ReduceType(other.data[i]);

}

~ReduceArray() noexcept {

for (unsigned int i = 0; i < size; ++i)

  data[i].~ReduceType();

}

ReduceType& operator[] (unsigned int i) const noexcept {

return data[i];

}

private:

struct Deleter {

void operator()(ReduceType* p) const noexcept {

  delete p;

}

};

std::unique_ptr<ReduceType[], Deleter> data;

unsigned int size;

};

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/LLNL/RAJA/issues/246#issuecomment-301804972, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ARGiBEHtfO2BBN4-88cHUXzWOGm5QRL4ks5r6bY0gaJpZM4Nb0VH.

willkill07 commented 7 years ago

For example: If I have N-1 independent reductions, how do I map this to CUDA? I can have N-1 subsets of threads doing their own reductions, but I don’t want each thread involved in N-1 reductions. We might have 10k zones, and 1k groups… so it would be nice to have ~100k threads for a CUDA kernel, but you don’t want each thread participating in 1k reductions. Also, you don’t want to launch 1k kernels each with their own reduction.

Well, it also depends what architecture you are targeting. Pascal and newer have really good double-precision atomic addition, so interleaving ~100 reducers at a time wouldn't be that bad of a strategy. It will be far from peak performance, but you get the bandwidth increase over CPU.

Granted, you wouldn't want all 1K reducers going through at once. forallN+tile helps tile the problem small enough to fit well, but I doubt it would scale as well if you're performing a different reduction operation (like minloc).

Does anyone have any ideas of how to engineer this? I’m not sure this is exactly what Tom wanted, but I think it would be generally useful.

I think expanding ReduceTypes to include reduction arrays would be a good first step. This strategy would map well to both CPUs and GPUs and allow the user to specify how big the reduction arrays are. The machinery is already in place to tile for a smaller problem size accordingly with forallN.

DavidPoliakoff commented 7 years ago

It'd be worth thinking about how this might play with any reduction refactoring we're thinking of doing. If we do move to one of the things @trws and @willkill07 were talking about, a model where we don't just add into a captured object blindly, how would that look with reduction arrays?

tepperly commented 7 years ago

Thanks for the suggestions. It also works to make an array type that I use as the reduction type (the type passed into ReduceSum). Ultimately, I would like a solution that's valid for GPU and OpenMP (general purpose cores).

gardner48 commented 6 years ago

I have a similar case to Adam (I'm computing the dot product of a vector with multiple other vectors i.e., dotprods[j] += x[i] * Y[j][i]). Have there been any changes since this thread started that would helpful for handling reduction arrays on GPUs?

DavidPoliakoff commented 6 years ago

Hey David (@gardner48 ),

We've done a lot of work in a construct called "Kernel" which we're testing these kinds of things in. I'll ping @ajkunen and @jonesholger as they've done the most with it, but I believe that's where a lot of our reduction array work is happening

artv3 commented 6 years ago

Hi @gardner48, In addition to what @DavidPoliakoff said, I would also checkout the RAJA atomics. RAJA atomics may be used to accomplish the same goal. Example 10 illustrates a binning example.

gardner48 commented 6 years ago

Thanks David (@DavidPoliakoff) and Arturo (@artv3).

David, is the "Kernel" construct something that's in the current version of RAJA or is it in development/testing branch for a future release?

Arturo, are you suggesting to use an atomicAdd for each element in the dot products or is there a way to get partial reductions from each thread block and then use an atomicAdd to update the global value?

artv3 commented 6 years ago

Hey David (@gardner48),

I was thinking using the atomic adds on dotprods (i.e. RAJA::atomic::atomicAdd(&dotprods[id], (x[i] * Y[j][i])); ) I hear the CUDA atomics are pretty fast now.

rhornung67 commented 6 years ago

@gardner48 if you want to use the 'kernel' stuff, I suggest using the 'release/0.6.0rc3' branch. We are preparing this for an upcoming release with applications using it in anger now. There is some basic documentation at https://readthedocs.org/projects/raja/ (go to the release-0.6.0rc3 version) and I'm working on filling in more.

gardner48 commented 6 years ago

Thanks @artv3 that's what I thought you meant. In my case i is much larger than j and id = j. So, for now at least, I'm guessing it's better to compute the dot products separately using a scalar reduction variable and I'll try out the 'kernel' stuff from the 0.6.0rc3 branch @rhornung67 pointed to.