Use case: container/range reductions

wrwilliams commented 6 years ago

I'm looking at using RAJA within Dyninst as a quick way to test and migrate among various forms of parallelism (cilk, openmp, tbb are the three of interest right now, as well as retaining serial capabilities). One of the things I want to parallelize is of the general form:

Worklist current, new;
while(!current.empty()) {
  for(x in current) {
    Worklist tmp = do_work(x);
    new += tmp;
  }
  swap(new, current);
}

where += is a set union on Worklists. This leads to a natural forall-reduce idiom if reductions support anything with an operator+= that takes responsibility for its own thread safety; however, I've not been able to hack out an outside-RAJA extension that works.

Two questions:

1) Is this something that should work without messing with RAJA internals, or am I reading things right that any Reducer is implicitly scalar only as there's no way to create a const_expr representing an empty STL container? 2) If I coded up an extension that allows Reducers to use insert(begin, end) as a sum operation, is that something you guys would be interested in adding? My general thinking is that SFINAE should allow a user to specify reducers that only work with certain policies, as long as there's at least two policies that work.

rhornung67 commented 6 years ago

@wrwilliams this is definitely something we are interested in supporting. I believe what you are trying to do was possible at some point. CPU execution, including OpenMP, should be straightforward. Making this work on a GPU with CUDA will be more difficult. We recently refactored the reduction internals and we may have introduced a constraint that we did not intend. We have to dig into this further. Do you have a simple example code that illustrates what you are attempting that we could look at?

@trws dou you have any thoughts about this?

wrwilliams commented 6 years ago

@rhornung67 Well, after some more work I was able to get something to compile legally, though it wasn't entirely straightforward. Problems:

1) RAJA doesn't respect the notion of compatible types via operator+ in its definition of += for reducers. 2) sets don't have random access iterators 3) I should have been using lambda-by-reference for many of the things I was attempting, rather than lambda-by-value

Relatively simple snippet of what compiles and works below. If there's a better and more idiomatic way to write this, let me know.

Unrelated note: using tbb parallelization correctly spawns one thread per core for me; using OpenMP is spawning a single thread under both ICC17 and gcc7.2. I'm going to sanity check with a non-RAJA trivial OpenMP test, but is there a known gotcha with respect to RAJA configuration?

struct FrameSet : public std::set<ParseFrame*>
{
    // necessary to allow reducers to initialize with zero
    FrameSet(unsigned int = 0) {}
    // should work for adding a vector as well, but RAJA requires a FrameSet
    template <typename ContainerT>
    FrameSet operator+(const ContainerT& c) const
    {
        FrameSet result(*this);
        result.insert(c.begin(), c.end());
        return result;
    }
};

void
Parser::parse_frames(FrameSet &work, bool recursive)
{
    // using FrameSet throughout for conceptual clarity, except here where we need
    // an explicit vector for random access
    std::vector<ParseFrame*> the_queue(work.begin(), work.end());
    work.clear();
    while(!the_queue.empty())
    {
        std::cout << "Begin iteration, worklist size is " << work.size() << endl;
        /* Recursive traversal parsing */
        FrameSet all_new_frames;
        RAJA::ReduceSum<RAJA::omp_reduce, FrameSet > reducer(all_new_frames);
        RAJA::forall<RAJA::omp_parallel_for_exec>
        (RAJA::RangeSegment(0, the_queue.size()), [&](RAJA::Index_type i)
        {
            FrameSet new_frames = ProcessOneFrame(the_queue[i], recursive);
            // adding an operator+= that takes a vector<ParseFrame*> also doesn't
            // help the reducer figure out what to do
            reducer += new_frames;
        });
        the_queue.clear();
        all_new_frames = reducer.get();
        std::copy(all_new_frames.begin(), all_new_frames.end(),
            std::back_inserter(the_queue));
    }
...
}

DavidPoliakoff commented 6 years ago

@wrwilliams : I'm 99% sure you aren't targeting CUDA parallelism in Dyninst, this probably won't impact you, but...

I'd be a little cautious with using capture-by-reference if you're hoping to target CUDA parallelism, as CUDA is finnicky about requiring everything to be capture-by-value. Ditto the STL. Anyway, just a caution about something you might inadvertently cut yourself off from.

trws commented 6 years ago

“ anything with an operator+= that takes responsibility for its own thread safety”

If you mean that the set in question is a thread-safe set, the reducers aren’t required, you can use it directly for any host policy we currently have, but probably want to use a pointer or a by-reference lambda.

On 1 Nov 2017, at 13:18, Bill Williams wrote:

@rhornung67 Well, after some more work I was able to get something to compile legally, though it wasn't entirely straightforward. Problems:

1) RAJA doesn't respect the notion of compatible types via operator+ in its definition of += for reducers.

That we can, and really should, fix. An issue on this specifically would be good, cc @willkill and me on it if you would and I’ll make sure something happens to fix that.

2) sets don't have random access iterators

At present we don’t support anything lower because they can’t be parallelized. If it’s sufficiently desirable for serial loops we can look into it, but it hasn’t yet been a priority.

3) I should have been using lambda-by-reference for many of the things I was attempting, rather than lambda-by-value

If you want reductions or GPUs to work, no, otherwise maybe.

Relatively simple snippet of what compiles and works below. If there's a better and more idiomatic way to write this, let me know.

Unrelated note: using tbb parallelization correctly spawns one thread per core for me; using OpenMP is spawning a single thread under both ICC17 and gcc7.2. I'm going to sanity check with a non-RAJA trivial OpenMP test, but is there a known gotcha with respect to RAJA configuration?

There is not. The most likely issue is that you used something like srun -n 1 and intel and gcc correctly respected the number of cores requested by slurm while tbb did not. Otherwise you can check OMP_THREAD_LIMIT, OMP_NUM_THREADS and similar to see what’s up.

The code you have is basically idiomatic at the moment, but since we do support arbitrary random-access iterables, the vector can be passed directly to the forall and each element will be passed to the lambda rather than having to index into it.

LLNL / RAJA

Use case: container/range reductions #356