flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
159 stars 49 forks source link

pmi: MPI job working in v0.55 fails in v0.63 #6044

Open grondo opened 2 weeks ago

grondo commented 2 weeks ago

An unknown MPI app on Frontier that was working with flux-core v0.55 started failing after an upgrade to v0.63 with the following:

MPICH ERROR [Rank 0] [job id unknown] [Fri Jun 14 18:55:23 2024] [frontier04856] - Abort(1091855) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170):
MPID_Init(441).......:
MPIR_pmi_init(110)...: PMI_Init returned -1

PMI2_Abort: (-1) Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170):
MPID_Init(441).......:
MPIR_pmi_init(110)...: PMI_Init returned -1

We'll attempt to get more details next week. The user is testing v0.58 now to see if they still see the failure.

grondo commented 2 weeks ago

The MPI issue above was fixed by going back to v0.58.

grondo commented 2 weeks ago

MPI version is cray-mpich/8.1.23