SGpp / DisCoTec

MPI-based code for distributed HPC simulations with the sparse grid combination technique. Docs->(https://discotec.readthedocs.io/)
https://sparsegrids.org/
GNU Lesser General Public License v3.0
8 stars 7 forks source link

Fix allreduce error #97

Closed freifrauvonbleifrei closed 1 year ago

freifrauvonbleifrei commented 1 year ago

For some setups, we would get MPI_Allreduce truncation errors (but weirdly, not for the weak scaling setup). Here is a fix along with a few more sanity checks.

PS: errors like

Abort(203042319) on node 4092 (rank 4092 in comm 0): Fatal error in PMPI_Wait: Other MPI error, error stack:
PMPI_Wait(205)..................: MPI_Wait(request=0x7ffd5b0155cc, status=0x1) failed
MPIR_Wait(105)..................: 
MPIDU_Sched_progress_state(1036): Invalid communicator

can be attributed to a "wrong" default of I_MPI_ADJUST_IBCAST that does not allow for non-power-of-two numbers of groups. parameters that worked for us (on IntelMPI 2019.12) were 1 and 4