ddemidov / vexcl

VexCL is a C++ vector expression template library for OpenCL/CUDA/OpenMP
http://vexcl.readthedocs.org
MIT License
702 stars 82 forks source link

Histogram the vexcl way #194

Closed Reneg973 closed 8 years ago

Reneg973 commented 8 years ago

What about a histogram calculation in vexcl? Would be a nice enhancement.

ddemidov commented 8 years ago

One can use sort and reduce_by_key algorithms to compute a histogram. Something like this:

#include <iostream>
#include <vexcl/vexcl.hpp>

int main() {
    vex::Context ctx(vex::Filter::Env && vex::Filter::Count(1));
    std::cout << ctx << std::endl;

    size_t n = 1024 * 1024;

    vex::vector<double> x(ctx, n);
    x = vex::RandomNormal<double>()(vex::element_index(), 0);

    vex::vector<int> ibin(ctx, n);
    vex::vector<int> ones(ctx, n);

    ibin = x / 0.1;
    ones = 1;

    vex::vector<int> obin;
    vex::vector<int> hist;

    vex::sort(ibin);
    vex::reduce_by_key(ibin, ones, obin, hist);

    std::cout << obin << std::endl;
    std::cout << hist << std::endl;
}

This is probably not the most efficient way to do this, so patches are welcome :).

ddemidov commented 8 years ago

Another option (about three times faster on my W9100) is using a custom kernel with atomic operations:

#include <iostream>
#include <vexcl/vexcl.hpp>

int main() {
    vex::Context ctx(vex::Filter::Env && vex::Filter::Count(1));
    std::cout << ctx << std::endl;

    cl_ulong n = 1024 * 1024;

    vex::RandomNormal<double> rnd;
    vex::Reductor<double, vex::MIN_MAX> minmax(ctx);

    vex::vector<double> x(ctx, n);
    x = rnd(vex::element_index(), 0);

    cl_double2 bnd = minmax(x);

    int nbins = static_cast<int>((bnd.s[1] - bnd.s[0]) / 0.1) + 1;

    vex::vector<int> hist(ctx, nbins);
    hist = 0;

    vex::backend::kernel get_hist(ctx.queue(0), VEX_STRINGIZE_SOURCE(
                kernel void get_hist(
                    ulong n, double xmin, global double *x, global int *hist
                    )
                {
                    for(ulong i = get_global_id(0); i < n; i += get_global_size(0)) {
                        int bin = (int)((x[i] - xmin) / 0.1);
                        atomic_add(hist + bin, 1);
                    }
                }
                ), "get_hist");

    get_hist(ctx.queue(0), n, bnd.s[0], x(0), hist(0));

    std::cout << hist << std::endl;
}
Reneg973 commented 8 years ago

Unfortunately also not the best way for many samples on very small histogram. I'm thinking about a calculation which first calculates the partial histograms locally or per wavefront, reduce them into global memory which has num_workgroup histograms, and finally merge these histograms in a second kernel. But currently I need to go deeper into OpenCL architecture to find a generic way to do this.

ddemidov commented 8 years ago

If you are able to fit the histogram into local (shared) memory, then you could also build partial histograms in each workgroup's local array then merge the local arrays into global one with atomic operations. This could be done in a single kernel, and so has a chance to be more effective than two kernels.

Reneg973 commented 8 years ago

Yes, in most cases an 8bit histogram is enough, means 1kB local Memory. But merging all these into global I thought I need a different Input range. That's why I suggested 2 kernels. I'm currently helpless how to join These both algorithm's ranges.

ddemidov commented 8 years ago

Here is an example of single kernel using local memory. Its another three times faster than the previous attempt:

#include <iostream>
#include <vexcl/vexcl.hpp>

int main() {
    vex::Context ctx(vex::Filter::Env && vex::Filter::Count(1));
    std::cout << ctx << std::endl;

    cl_ulong n = 1024 * 1024;

    vex::RandomNormal<double> rnd;
    vex::Reductor<double, vex::MIN_MAX> minmax(ctx);

    vex::vector<double> x(ctx, n);
    x = rnd(vex::element_index(), 0);

    cl_double2 bnd = minmax(x);

    double xmin = bnd.s[0];
    double xmax = bnd.s[1];

    int nbins = 16;
    double dx = (xmax - xmin) / nbins;

    vex::vector<int> hist(ctx, nbins);
    hist = 0;

    vex::backend::kernel get_hist(ctx.queue(0), VEX_STRINGIZE_SOURCE(
                kernel void get_hist(
                    ulong n, double xmin, double dx, global double *x,
                    int nbins, global int *ghist, local int *lhist
                    )
                {
                    // Init local histogram.
                    for(int i = get_local_id(0); i < nbins; i += get_local_size(0))
                        lhist[i] = 0;

                    barrier(CLK_LOCAL_MEM_FENCE);

                    // Fill local histogram.
                    for(ulong i = get_global_id(0); i < n; i += get_global_size(0)) {
                        int bin = (int)((x[i] - xmin) / dx);
                        if (bin >= 0 && bin < nbins)
                            atomic_add(lhist + bin, 1);
                    }

                    barrier(CLK_LOCAL_MEM_FENCE);

                    // Update global histogram.
                    for(int i = get_local_id(0); i < nbins; i += get_local_size(0))
                        atomic_add(ghist + i, lhist[i]);
                }
                ), "get_hist", [=](size_t wg_size){ return nbins * sizeof(int); }
                );

    get_hist(ctx.queue(0), n, xmin, dx, x(0), nbins, hist(0), cl::LocalSpaceArg{nbins * sizeof(int)});

    std::cout << hist << std::endl;
}
Reneg973 commented 8 years ago

Well, works fine. And do you know why the example Code for histograms (see Apple or NVidia) always uses 2 steps?

ddemidov commented 8 years ago

I never saw those examples, sorry.