[optimization] communication that follows the dtrsm steps (4th and 5th step) is reimplemented using MPI_Scatterv instead of MPI_Isend and MPI_Irecv.
[optimization] communication of pivot rows, that was previously using MPI_Put was reimplemented in 2 ways:
1) using collectives: MPI_Gatherv. The main challange here is that the root rank does not have the information on the number of pivots it should receive from each rank, so we have to do an additional MPI_Gather of the number of pivot rows each rank contains. Another problem was that curPivotOrder[i] was showing the position of i-th pivot row on the target rank and not on the source rank. To resolve this, we prepend curPivotOrder[i] to i-th row before sending, so that the receiver knows where to copy the row upon arrival.
2) using MPI_Isend/MPI_Irecv: this approach allows the overlap between receiving the data and unpacking it (copying it to the right row). However, to allow a complete overlap, this requires a temporary local buffer of size Px*v*Nl which is reasonable for small values of v. When v=M/Px this is infeasible. In case this requires too much memory, we use the collectives implementation from 1).
[optimization] each communication now uses a specialized communicator, to reduce unnecessary synchronization. For example, the communication after the tournament pivoting occurs only within layrK, so no need to use the full communicator lu_comm.
[optimization] added and optimized some OpenMP regions.
[bugfix] fixed the operator precedence of & and && in the tournament_rounds function.
[bugfix] fixed the profiling regions.
[smallfix] fixed the compilation errors with the Score-P compiler.
This PR brings the following improvements:
MPI_Scatterv
instead ofMPI_Isend
andMPI_Irecv
.MPI_Put
was reimplemented in 2 ways: 1) using collectives:MPI_Gatherv
. The main challange here is that the root rank does not have the information on the number of pivots it should receive from each rank, so we have to do an additionalMPI_Gather
of the number of pivot rows each rank contains. Another problem was thatcurPivotOrder[i]
was showing the position of i-th pivot row on the target rank and not on the source rank. To resolve this, we prepend curPivotOrder[i] to i-th row before sending, so that the receiver knows where to copy the row upon arrival. 2) usingMPI_Isend/MPI_Irecv
: this approach allows the overlap between receiving the data and unpacking it (copying it to the right row). However, to allow a complete overlap, this requires a temporary local buffer of sizePx*v*Nl
which is reasonable for small values ofv
. Whenv=M/Px
this is infeasible. In case this requires too much memory, we use the collectives implementation from 1).lu_comm
.&
and&&
in thetournament_rounds
function.Score-P
compiler.