Optimizing the communication patterns

This PR brings the following improvements:

[optimization] communication that follows the dtrsm steps (4th and 5th step) is reimplemented using MPI_Scatterv instead of MPI_Isend and MPI_Irecv.
[optimization] communication of pivot rows, that was previously using MPI_Put was reimplemented in 2 ways: 1) using collectives: MPI_Gatherv. The main challange here is that the root rank does not have the information on the number of pivots it should receive from each rank, so we have to do an additional MPI_Gather of the number of pivot rows each rank contains. Another problem was that curPivotOrder[i] was showing the position of i-th pivot row on the target rank and not on the source rank. To resolve this, we prepend curPivotOrder[i] to i-th row before sending, so that the receiver knows where to copy the row upon arrival. 2) using MPI_Isend/MPI_Irecv: this approach allows the overlap between receiving the data and unpacking it (copying it to the right row). However, to allow a complete overlap, this requires a temporary local buffer of size Px*v*Nl which is reasonable for small values of v. When v=M/Px this is infeasible. In case this requires too much memory, we use the collectives implementation from 1).
[optimization] each communication now uses a specialized communicator, to reduce unnecessary synchronization. For example, the communication after the tournament pivoting occurs only within layrK, so no need to use the full communicator lu_comm.
[optimization] added and optimized some OpenMP regions.
[bugfix] fixed the operator precedence of & and && in the tournament_rounds function.
[bugfix] fixed the profiling regions.
[smallfix] fixed the compilation errors with the Score-P compiler.

eth-cscs / conflux

Optimizing the communication patterns #22