Open grondo opened 2 weeks ago
An unknown MPI app on Frontier that was working with flux-core v0.55 started failing after an upgrade to v0.63 with the following:
MPICH ERROR [Rank 0] [job id unknown] [Fri Jun 14 18:55:23 2024] [frontier04856] - Abort(1091855) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(170): MPID_Init(441).......: MPIR_pmi_init(110)...: PMI_Init returned -1 PMI2_Abort: (-1) Fatal error in PMPI_Init: Other MPI error, error stack: MPIR_Init_thread(170): MPID_Init(441).......: MPIR_pmi_init(110)...: PMI_Init returned -1
We'll attempt to get more details next week. The user is testing v0.58 now to see if they still see the failure.
The MPI issue above was fixed by going back to v0.58.
MPI version is cray-mpich/8.1.23
cray-mpich/8.1.23
An unknown MPI app on Frontier that was working with flux-core v0.55 started failing after an upgrade to v0.63 with the following:
We'll attempt to get more details next week. The user is testing v0.58 now to see if they still see the failure.