STEllAR-GROUP / hpx

The C++ Standard Library for Parallelism and Concurrency
https://hpx.stellar-group.org
Boost Software License 1.0
2.51k stars 427 forks source link

Avoid fork/join parallelism in implementations of N3989 #1185

Open sithhell opened 10 years ago

sithhell commented 10 years ago

While the algorithms implemented in #1141 are a great start and will lead to easier user programs and a better API for programming with HPX, I think the current implementation lacks the possibility of fine grained data dependency control. Almost all algorithms perform some form of barrier (even if the barrier is "asynchronous" in the form of dataflow) and are more or less implemented in a fork/join manner.

I think this needs to be carefully analyzed once the algorithms are implemented since HPX is about to avoid fork/join style parallelism and barriers, and an implementation removing those barriers need to be provided.

One barrier can be observed here: https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/parallel/util/partitioner.hpp#L208

K-ballo commented 10 years ago

It should be possible to use when_each on the workitems so that the step 2 is done piecewise as each individual future becomes ready.

hkaiser commented 10 years ago

It will not be possible to completely avoid the joining as this is part of the semantics of the algorithms. Moreover:

Besides, the parallel algorithms open a perfect migration path from classic OMP loops. Even replacing OMP parallel for loops with parallel::for_each is beneficial as this alone removes resource stalling inherent to OMP loops.

Overall, I think it is not worth changing the interface away from the Standards proposal in order to expose the array of futures representing the parallel partitions to the user. It's always possible to fall back to raw HPX if this kind of fine control is needed.

hkaiser commented 10 years ago

heller: hkaiser: i fully disagree with your answer #1185. I don't like the general tenor of this answer and your proposed resolution. But we can discuss that tomorrow or so. hkaiser: what is wrong with my arguments? heller: hkaiser: fork/join always adds unnecessary overhead. The argument that only one core is involved is not quite correct (when_all uses atomics for a reason) hkaiser: I said one core is used hkaiser: I don't disagree that fork/join adds overhead, so does attaching continuations etc heller: hkaiser: you know. Killing fork/join parallelism used to be one of the killer arguments for hpx hkaiser: we were planning to change the implementation of when_all to be like dataflow which would remove the atomic, btw hkaiser: I disagree, fork/join is very strict which allows to do additional optimizations, but that's not the point here heller: hkaiser: the proposal was written by people who still believe in fork/join. I personally do not. I believe fork/join will never scale due to the serial portion at the start and the end. And it is IMHO naive to believe that those serial portions go away because things are done asynchronous. hkaiser: heller: I agree, it's still a step in the right directlion as it makes HPX easier to use which helps adaptation heller: hkaiser: it could also be disadvantage to us. heller: Remember what we discussed with Adrian. hkaiser: HPX is too difficult to use as it is heller: What makes hpx different? heller: That's true. And I'd love to see a mechanism that allows to use those techniques we developed for stencil codes easier. heller: If we use the parallel algorithms, there won't be any difference except syntax to a MPI+omp code hkaiser: that's not true heller: OK. Prove me wrong hkaiser: it's worse than plain HPX, but better than conventional OMP heller: I seriously doubt that hkaiser: as I said, cores do not stall in the end heller: I don't see that hkaiser: why not? hkaiser: the main problem with OMP loop is not the join, but the stalling of resources towards the end of the loop heller: Because you model to coarse grain data dependencies hkaiser: that does not happen with HPX hkaiser: no heller: That's not 100% true hkaiser: work is still as fine-grain as you want hkaiser: but 99% heller: While the work is fine grain. You still need everything to finish. Same as with omp. Where is the gain? hkaiser: resources don't stall hkaiser: cores are used for other things when not needed anymore hkaiser: doesn't happen in OMP heller: Which things? hkaiser: other parallel work heller: For example? hkaiser: whatever you have in your code heller: Which I don't, cause I need the result. Not all, it's fine if some points plus their neighbors is ready heller: I still need to wait for everything. Now what? hkaiser: that does not change if you don't join, you still need to do the work before you can go on heller: Sure. But I avoid the join. Which we showed multiple times that this is significant. hkaiser: I agree that if you have no other work the loops don't give you any benefit, though heller: Especially at a high core count heller: hkaiser: and that's the case for 80% or more of the codes out there. hkaiser: heller: I disagree heller: With what do you disagree? hkaiser: if you think about it there is other work almost all of the time hkaiser: be it network, or other auxilliary stuff heller: Which only happens because we do fine grained, constrained based task graphs hkaiser: heller: what do you suggest, no to implement the standards proposal? heller: My personality is split... heller: On the one hand this proposal is great. It allows existing code to be easily adapted/migrated and parallelized heller: On the other hand, it's exactly that. Nothing more. My fear is then that we just got stuck with the old techniques again. heller: Which would be a shame considering the possibilities we have hkaiser: heller: nobody forces you to use these algorithms where not appropriate heller: And in addition, if we go the "one hand" route, the obvious advantages of why to use hpx start to vanish hkaiser: come on, if we don't provide useable facilities NOBODY will use it heller: hkaiser: that's the point. I think they are never appropriate :p heller: hkaiser: but that's only my view... Might be very well wrong hkaiser: 95% of all C++ programmers are not skilled enough to use HPX properly heller: Absolutely... heller: But that won't change with the parallel algorithms Zao: hkaiser: Get it into a popular software package and you don't need them to. Zao: Or "codes" as the researchers insists on calling them. heller: Well, it will probably decrease to 90 or 80 percent heller: hkaiser: I know that my points are weak because I don't have an alternative. However, I think our focus should rather be in finding those alternatives than hunting the past heller: hkaiser: but sure, a migration path is needed as well. heller: That doesn't mean I need to like it though :p heller: I don't think those are in the spirit of hpx as I got to know it hkaiser: I think the algorithms are very useful in combination with futurization hkaiser: heller: when we use them not for the actual work, but for building the execution tree in parallel heller: hkaiser: ok. let's experiment with that

hkaiser commented 10 years ago

I think we can close this ticket as the facilities implemented by the parallel algorithms (#1141) are completely sufficient to build higher level tools which take advantage of a more fine grained data dependency control.

Let's say we have two loops which have a strict sequential requirement (you can execute an iteration in loop 2 only after the same iteration in loop 1 is done):

std::vector<double> d = { ... };
std::for_each(std::begin(d), std::end(d), [](double& d) { d = d+1 });
std::for_each(std::begin(d), std::end(d), [](double& d) { d = 2*d });

Then this can be parallelized and (partially) overlapped by doing something like:

// some large array of values
std::vector<double> d = { ... };

// calculate grain size, any other criteria than number of cores works as well
std::size_t cores = hpx::get_os_thread_count();
std::size_t chunk_size = (d.size() + cores - 1) / cores;
std::size_t count = d.size() + chunk_size-1;

std::vector<hpx::future<void> > chunks;
for (std::size_t p = 0; p < count; p += chunk_size)
{
    std::vector<double>::iterator s = std::begin(d);
    std::advance(s, p);

    std::vector<double>::iterator e = std::begin(d);
    std::advance(e, std::min(d.size(), p+chunk_size));

    hpx::future<void> f =
        hpx::parallel::for_each(hpx::parallel::task, s, e,
            [](double& d) { d = d+1 });

    chunks.push_back(f.then(
        hpx::util::unwrapped([=]()
        {
            hpx::parallel::for_each(hpx::parallel::par, s, e,
                [](double& d) { d = 2*d });
        })));
}

// join or build other, additional dependencies...
hpx::wait_all(chunks);

I'd argue that because of the differences between each of the possible use cases (e.g. which algorithms to chain, what inter-loop dependencies are in place, etc.) it would be very difficult to build a more generic solution for the example above.

At the same time, the existing parallel algorithms provide sufficiently fine grained building blocks to build arbitrary (overlapping) execution sequences while maintaining full control over the used grain size.

hkaiser commented 10 years ago

While thinking about this some more, the code above can be simplified by:

std::vector<double> d = { ... };
std::vector<hpx::future<void> > chunks;

typedef std::vector<double>::iterator iterator;
hpx::parallel::partition_range(std::begin(d), std::end(d),
    [&chunks](iterator b, iterator e)
    {
        hpx::future<void> f =
            hpx::parallel::for_each(hpx::parallel::task, b, e,
                [](double& d) { d = d+1 });

        chunks.push_back(f.then(
            hpx::util::unwrapped([=]()
            {
                hpx::parallel::for_each(hpx::parallel::par, b, e,
                    [](double& d) { d = 2*d });
            })));
    });

// join or build other, additional dependencies...
hpx::wait_all(chunks);

Thus, a possible resolution for this ticket could be to implement some partitioners (as discussed before).

sithhell commented 10 years ago

I am not convinced. While the suggested solutions work and are indeed somewhat elegant. They do not need to call the parallel algorithms in the first place. Calling std::for_each instead of hpx::parallel::for_each is just as feasible and leads to the same amount of code needed and similar complexity (https://gist.github.com/sithhell/365481b3055bf43f72f9).

I think the goal of the parallel algorithms, and the intent of this ticket, should be to help ease the programming of such code constructs. While I agree that the sketched partition_range does help here, it doesn't necessarily need the algorithms defined in hpx::parallel.

Ideally, I think something like this would be nice:

std::vector<double> d = { ... };

// returns an object which has something like a vector of a pair containing:
//    1. a future: the completion of one parition
//    2. a range: marking the begin and end of the partition
auto deps =
    hpx::parallel::for_each(hpx::parallel::task, std::begin(d), std::end(d),
        [](double& d) { d = d+1; });

// for_each might take an additional argument for dependencies where
// the type must match with the passed iterators.Where each partition only
//gets applied with the passed functor when the dependency is read
hpx::parallel::for_each(hpx::parallel::par, std::begin(d), std::end(d),
   [](double& d) { d = 2*d; }, deps);

This particular example might even be rewritten as follow:

std::vector<double> d = { ... };

hpx::parallel::for_each(hpx::parallel::par, std::begin(d), std::end(d)
 , [](double& d) { d = d+1; }
 , [](double& d) { d = 2*d; }
);

Which would result in an efficient continuation style chaining of the passed functors avoiding another "fork"-style call.

hkaiser commented 9 years ago

Moving this to the next milestone

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically closed. Please re-open if necessary.

biddisco commented 5 years ago

This issue look interesting - should it be kept alive?

hkaiser commented 5 years ago

Sure let's re-open it...