HPCE / hpce-2017-cw6

2 stars 17 forks source link

OpenMP VS TBB #16

Closed natoucs closed 6 years ago

natoucs commented 6 years ago

I discovered about OpenMP.

https://software.intel.com/en-us/intel-threading-building-blocks-openmp-or-native-threads

Out of curiosity, is there any scenarios where it would be favorable to use it instead of tbb ?

m8pple commented 6 years ago

One argument for OpenMP used to be that it "looked" better than TBB, as you could the code could keep the same structure as the sequential code. In comparison, with TBB you used to have to pull all the loop bodies out into function objects, which made the code much harder to write and maintain (see the ParallelAverage example. This made OpenMP "easier" to use, so it was quite popular with "computational-X" people, such as computational-chemists and computational-physicists, also helped by the fact that OpenMP works in Fortran as well.

However, since C++11 gave us lambdas, the argument in the C/C++ world has mostly gone away, as you can usually turn sequential C for loops directly into parallel_for loops that look pretty much the same now.

A remaining advantage of OpenMP is that (in principle) the compiler is able to optimise the code better. To humans the TBB parallel_for loops look like:

int z=...;
parallel_for(0, 100, [&](int i){
   x[i] = f[z] * y[i];
});

so we might expect the compiler to automatically hoist the load of f[z] out, into:

int z=...;
int tmp=f[z];
parallel_for(0, 100, [&](int i){
   x[i] = tmp * y[i];
});

But the compiler sees it as:

int z=...;
auto _lambda = [&](int i){
   x[i] = f[z]*y[i];
}; 

some_crazy_library_function_no_idea_what_it_does(0, 100, _lambda);

Because it doesn't know what TBB is going to do with the lambda function, it finds it much more difficult to prove that f[z] can be pre-calculated.

In comparison, the OpenMP compiler has much more visibility and knowledge and knows exactly how parallelism will be introduced at compile-time, so it will find it much easier to do that optimisation.

TBB has the same problem from the other side, so TBB has not idea that a parallel_for is going to happen until the moment the tasks get created. In comparison, OpenMP is able to do static analysis of the program, and knows at compile-time that the parallel_for is going to happen, so in principle it can tune the grain-size/agglomeration at compile time.

Apart from potential compile-time knowledge about parallelism, TBB is more functional in most ways, and has more parallel design patterns (though recent versions of OpenMP added more). It also requires no compiler support, which is really nice.

A lot of it comes down to personal preference and history though - physicists like OpenMP because they like for-loops; computer-scientists are more like to preferr TBB because they handle less regular problems that aren't just for loops.