Closed Reneg973 closed 8 years ago
One can use sort
and reduce_by_key
algorithms to compute a histogram. Something like this:
#include <iostream>
#include <vexcl/vexcl.hpp>
int main() {
vex::Context ctx(vex::Filter::Env && vex::Filter::Count(1));
std::cout << ctx << std::endl;
size_t n = 1024 * 1024;
vex::vector<double> x(ctx, n);
x = vex::RandomNormal<double>()(vex::element_index(), 0);
vex::vector<int> ibin(ctx, n);
vex::vector<int> ones(ctx, n);
ibin = x / 0.1;
ones = 1;
vex::vector<int> obin;
vex::vector<int> hist;
vex::sort(ibin);
vex::reduce_by_key(ibin, ones, obin, hist);
std::cout << obin << std::endl;
std::cout << hist << std::endl;
}
This is probably not the most efficient way to do this, so patches are welcome :).
Another option (about three times faster on my W9100) is using a custom kernel with atomic operations:
#include <iostream>
#include <vexcl/vexcl.hpp>
int main() {
vex::Context ctx(vex::Filter::Env && vex::Filter::Count(1));
std::cout << ctx << std::endl;
cl_ulong n = 1024 * 1024;
vex::RandomNormal<double> rnd;
vex::Reductor<double, vex::MIN_MAX> minmax(ctx);
vex::vector<double> x(ctx, n);
x = rnd(vex::element_index(), 0);
cl_double2 bnd = minmax(x);
int nbins = static_cast<int>((bnd.s[1] - bnd.s[0]) / 0.1) + 1;
vex::vector<int> hist(ctx, nbins);
hist = 0;
vex::backend::kernel get_hist(ctx.queue(0), VEX_STRINGIZE_SOURCE(
kernel void get_hist(
ulong n, double xmin, global double *x, global int *hist
)
{
for(ulong i = get_global_id(0); i < n; i += get_global_size(0)) {
int bin = (int)((x[i] - xmin) / 0.1);
atomic_add(hist + bin, 1);
}
}
), "get_hist");
get_hist(ctx.queue(0), n, bnd.s[0], x(0), hist(0));
std::cout << hist << std::endl;
}
Unfortunately also not the best way for many samples on very small histogram. I'm thinking about a calculation which first calculates the partial histograms locally or per wavefront, reduce them into global memory which has num_workgroup histograms, and finally merge these histograms in a second kernel. But currently I need to go deeper into OpenCL architecture to find a generic way to do this.
If you are able to fit the histogram into local (shared) memory, then you could also build partial histograms in each workgroup's local array then merge the local arrays into global one with atomic operations. This could be done in a single kernel, and so has a chance to be more effective than two kernels.
Yes, in most cases an 8bit histogram is enough, means 1kB local Memory. But merging all these into global I thought I need a different Input range. That's why I suggested 2 kernels. I'm currently helpless how to join These both algorithm's ranges.
Here is an example of single kernel using local memory. Its another three times faster than the previous attempt:
#include <iostream>
#include <vexcl/vexcl.hpp>
int main() {
vex::Context ctx(vex::Filter::Env && vex::Filter::Count(1));
std::cout << ctx << std::endl;
cl_ulong n = 1024 * 1024;
vex::RandomNormal<double> rnd;
vex::Reductor<double, vex::MIN_MAX> minmax(ctx);
vex::vector<double> x(ctx, n);
x = rnd(vex::element_index(), 0);
cl_double2 bnd = minmax(x);
double xmin = bnd.s[0];
double xmax = bnd.s[1];
int nbins = 16;
double dx = (xmax - xmin) / nbins;
vex::vector<int> hist(ctx, nbins);
hist = 0;
vex::backend::kernel get_hist(ctx.queue(0), VEX_STRINGIZE_SOURCE(
kernel void get_hist(
ulong n, double xmin, double dx, global double *x,
int nbins, global int *ghist, local int *lhist
)
{
// Init local histogram.
for(int i = get_local_id(0); i < nbins; i += get_local_size(0))
lhist[i] = 0;
barrier(CLK_LOCAL_MEM_FENCE);
// Fill local histogram.
for(ulong i = get_global_id(0); i < n; i += get_global_size(0)) {
int bin = (int)((x[i] - xmin) / dx);
if (bin >= 0 && bin < nbins)
atomic_add(lhist + bin, 1);
}
barrier(CLK_LOCAL_MEM_FENCE);
// Update global histogram.
for(int i = get_local_id(0); i < nbins; i += get_local_size(0))
atomic_add(ghist + i, lhist[i]);
}
), "get_hist", [=](size_t wg_size){ return nbins * sizeof(int); }
);
get_hist(ctx.queue(0), n, xmin, dx, x(0), nbins, hist(0), cl::LocalSpaceArg{nbins * sizeof(int)});
std::cout << hist << std::endl;
}
Well, works fine. And do you know why the example Code for histograms (see Apple or NVidia) always uses 2 steps?
I never saw those examples, sorry.
What about a histogram calculation in vexcl? Would be a nice enhancement.