Open tepperly opened 7 years ago
Would placement new work? I could allocate a "large enough" hunk of memory and then individually construct the ReduceSum objects with a placement new.
Something like this class I just rolled up appears to work:
template <typename ReduceType>
struct ReduceArray {
ReduceType* data;
unsigned int size;
template <typename T>
explicit ReduceArray (unsigned int count, T value)
: data (reinterpret_cast<ReduceType*>(malloc(sizeof(ReduceType) * count))),
size (count) {
for (unsigned int i = 0; i < count; ++i) {
data[i] = std::move(ReduceType(value));
}
}
ReduceArray (ReduceArray const & other)
: data (reinterpret_cast<ReduceType*>(malloc(sizeof(ReduceType) * other.size))),
size (other.size) {
for (unsigned int i = 0; i < size; ++i) {
data[i] = std::move(ReduceType(other.data[i]));
}
}
~ReduceArray () {
for (unsigned int i = 0; i < size; ++i) {
data[i].~ReduceType();
}
free(data);
}
ReduceType& operator[] (int i) const {
return data[i];
}
};
Example:
#include <vector>
#include <algorithm>
#include <numeric>
#include <iostream>
#include <utility>
#include <RAJA/RAJA.hpp>
using exec_pol = RAJA::omp_parallel_for_exec;
using red_pol = RAJA::omp_reduce;
int main() {
int reducerCount;
int N;
std::cin >> N >> reducerCount;
ReduceArray<RAJA::ReduceSum<red_pol, double>> reducers(reducerCount, 0.0);
double* a = new double[N];
std::iota(a, a + N, 0);
std::transform(a, a + N, a, [=](double v) { return v / N; });
RAJA::forall<exec_pol>(0, N, [=] (int i) {
int reducerId = i % reducerCount;
reducers[reducerId] += a[i];
});
for (int i = 0; i < reducerCount; ++i)
std::cout << reducers[i].get() << std::endl;
return EXIT_SUCCESS;
}
Output with parameters 1000000 10
yields the expected result across various number of openmp threads (1 - 20):
49999.5
49999.6
49999.7
49999.8
49999.9
50000
50000.1
50000.2
50000.3
50000.4
Ultimately, you'd want to use placement new / delete instead of my malloc/free monstrosity, but I think you get the idea
Here's a better implementation of ReduceArray which may allow me to sleep better at night:
#include <memory>
template <typename ReduceType>
class ReduceArray {
constexpr ReduceType* alloc(unsigned int count) noexcept {
return static_cast<ReduceType*>(::operator new(sizeof(ReduceType) * count));
}
public:
template <typename T>
explicit ReduceArray(unsigned int count, T value) noexcept
: data(alloc(count)), size(count) {
for (unsigned int i = 0; i < size; ++i)
new (std::addressof(data[i])) ReduceType(value);
}
ReduceArray(ReduceArray const& other) noexcept
: data(alloc(other.size)), size(other.size) {
for (unsigned int i = 0; i < size; ++i)
new (std::addressof(data[i])) ReduceType(other.data[i]);
}
~ReduceArray() noexcept {
for (unsigned int i = 0; i < size; ++i)
data[i].~ReduceType();
}
ReduceType& operator[] (unsigned int i) const noexcept {
return data[i];
}
private:
struct Deleter {
void operator()(ReduceType* p) const noexcept {
delete p;
}
};
std::unique_ptr<ReduceType[], Deleter> data;
unsigned int size;
};
This is a common pattern we see in transport…. A large number of our kernels are dimensionally reductive. Most of the time they take some N-dimensional array and produce an N-1-dimensional array.
Sometime the N-1 dimensions are embarrassingly parallel, and other times they are not.
There are a couple of things that make this hard… and one of the biggies are GPU’s.
For example: If I have N-1 independent reductions, how do I map this to CUDA? I can have N-1 subsets of threads doing their own reductions, but I don’t want each thread involved in N-1 reductions.
Here is an example kernel in pseudo code: (which happens to be a common pattern in ARDRA)
Given phi[zone][group] and w[group] produce spectrum[group]:
foreach zone: foreach group: spectrum[group] += phi[zone][group] * w[group]
We might have 10k zones, and 1k groups… so it would be nice to have ~100k threads for a CUDA kernel, but you don’t want each thread participating in 1k reductions. Also, you don’t want to launch 1k kernels each with their own reduction.
Does anyone have any ideas of how to engineer this? I’m not sure this is exactly what Tom wanted, but I think it would be generally useful.
-Adam
From: Will Killian notifications@github.com Reply-To: LLNL/RAJA reply@reply.github.com Date: Tuesday, May 16, 2017 at 7:44 AM To: LLNL/RAJA RAJA@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [LLNL/RAJA] Need array of reduction variables (#246)
Here's a better implementation of ReduceArray which may allow me to sleep better at night:
template
class ReduceArray {
constexpr ReduceType* alloc(unsigned int count) noexcept {
return static_cast<ReduceType*>(::operator new(sizeof(ReduceType) * count));
}
public:
template
explicit ReduceArray(unsigned int count, T value) noexcept
: data(alloc(count)), size(count) {
for (unsigned int i = 0; i < size; ++i)
new (std::addressof(data[i])) ReduceType(value);
}
ReduceArray(ReduceArray const& other) noexcept
: data(alloc(other.size)), size(other.size) {
for (unsigned int i = 0; i < size; ++i)
new (std::addressof(data[i])) ReduceType(other.data[i]);
}
~ReduceArray() noexcept {
for (unsigned int i = 0; i < size; ++i)
data[i].~ReduceType();
}
ReduceType& operator[] (unsigned int i) const noexcept {
return data[i];
}
private:
struct Deleter {
void operator()(ReduceType* p) const noexcept {
delete p;
}
};
std::unique_ptr<ReduceType[], Deleter> data;
unsigned int size;
};
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/LLNL/RAJA/issues/246#issuecomment-301804972, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ARGiBEHtfO2BBN4-88cHUXzWOGm5QRL4ks5r6bY0gaJpZM4Nb0VH.
For example: If I have N-1 independent reductions, how do I map this to CUDA? I can have N-1 subsets of threads doing their own reductions, but I don’t want each thread involved in N-1 reductions. We might have 10k zones, and 1k groups… so it would be nice to have ~100k threads for a CUDA kernel, but you don’t want each thread participating in 1k reductions. Also, you don’t want to launch 1k kernels each with their own reduction.
Well, it also depends what architecture you are targeting. Pascal and newer have really good double-precision atomic addition, so interleaving ~100 reducers at a time wouldn't be that bad of a strategy. It will be far from peak performance, but you get the bandwidth increase over CPU.
Granted, you wouldn't want all 1K reducers going through at once. forallN
+tile
helps tile the problem small enough to fit well, but I doubt it would scale as well if you're performing a different reduction operation (like minloc).
Does anyone have any ideas of how to engineer this? I’m not sure this is exactly what Tom wanted, but I think it would be generally useful.
I think expanding ReduceTypes to include reduction arrays would be a good first step. This strategy would map well to both CPUs and GPUs and allow the user to specify how big the reduction arrays are. The machinery is already in place to tile for a smaller problem size accordingly with forallN
.
It'd be worth thinking about how this might play with any reduction refactoring we're thinking of doing. If we do move to one of the things @trws and @willkill07 were talking about, a model where we don't just add into a captured object blindly, how would that look with reduction arrays?
Thanks for the suggestions. It also works to make an array type that I use as the reduction type (the type passed into ReduceSum). Ultimately, I would like a solution that's valid for GPU and OpenMP (general purpose cores).
I have a similar case to Adam (I'm computing the dot product of a vector with multiple other vectors i.e., dotprods[j] += x[i] * Y[j][i]
). Have there been any changes since this thread started that would helpful for handling reduction arrays on GPUs?
Hey David (@gardner48 ),
We've done a lot of work in a construct called "Kernel" which we're testing these kinds of things in. I'll ping @ajkunen and @jonesholger as they've done the most with it, but I believe that's where a lot of our reduction array work is happening
Hi @gardner48, In addition to what @DavidPoliakoff said, I would also checkout the RAJA atomics. RAJA atomics may be used to accomplish the same goal. Example 10 illustrates a binning example.
Thanks David (@DavidPoliakoff) and Arturo (@artv3).
David, is the "Kernel" construct something that's in the current version of RAJA or is it in development/testing branch for a future release?
Arturo, are you suggesting to use an atomicAdd for each element in the dot products or is there a way to get partial reductions from each thread block and then use an atomicAdd to update the global value?
Hey David (@gardner48),
I was thinking using the atomic adds on dotprods
(i.e. RAJA::atomic::atomicAdd
@gardner48 if you want to use the 'kernel' stuff, I suggest using the 'release/0.6.0rc3' branch. We are preparing this for an upcoming release with applications using it in anger now. There is some basic documentation at https://readthedocs.org/projects/raja/ (go to the release-0.6.0rc3 version) and I'm working on filling in more.
Thanks @artv3 that's what I thought you meant. In my case i
is much larger than j
and id = j
. So, for now at least, I'm guessing it's better to compute the dot products separately using a scalar reduction variable and I'll try out the 'kernel' stuff from the 0.6.0rc3 branch @rhornung67 pointed to.
I have data in a 3-D array indexed by time and two spatial dimensions. When running, I need to periodically report data reduced from this 3-D array. The problem definition can specify an arbitrary number of reference regions where separate statistics need to be calculated. Each reference region defines a subset of the spatial domain where the end user wants a separate set of summary statistics calculated. The number of reference regions is not known at compile time, so I would like to make an array of ReduceSum<> variables at runtime. This doesn't seem possible because ReduceSum has an explicit constructor, and I can't figure a way to make a container (array, vector, ...) of ReduceSum variables.