flux-framework / flux-test-collective

Holistic system testing and CI for multiple flux-framework projects
GNU Lesser General Public License v3.0
0 stars 2 forks source link

mpi: do node-exclusive scheduling for cray-pals PMI #19

Closed wihobbs closed 7 months ago

wihobbs commented 7 months ago

Problem: as documented in the "CORAL2: Flux on Cray Shasta" page in the flux docs, two flux subinstances sharing the same nodes can fail due to overlapping port numbers. For some reason, ever since the vcpu test was added, this has been happening more often.

The solution is to do node-exclusive scheduling at the top level so the jobs run sequentially.

wihobbs commented 7 months ago

Example error:


job ft3eGmd completed:
Running with cce compiler and cray-mpich MPI
f28ZjEqV
f28bDE7q
f28chDQB
f28eBCgX
f3ctHRVy
f3cumQnK
Fri Mar  1 09:11:30 2024: [PE_2]:inet_listen_socket_setup:bind() failed [fd=3, port=11998 err='Address already in use']
f3cumQnL
f3cxjPM1
Fri Mar  1 09:11:30 2024: [PE_2]:_pmi_inet_listen_socket_setup:socket setup failed
f3ctHRVy: completed MPI_Init in 0.373s.  There are 4 tasks
Fri Mar  1 09:11:30 2024: [PE_2]:control_nets_listen:_pmi_inet_listen_socket_setup (full) returned -1
f3ctHRVy: completed first barrier in 0.000s
MPICH ERROR [Rank 0] [job id unknown] [Fri Mar  1 09:11:30 2024] [tioga15] - Abort(1092879) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
f3ctHRVy: completed MPI_Finalize in 0.011s
MPIR_Init_thread(170): 
MPID_Init(441).......: 
MPIR_pmi_init(110)...: PMI_Init returned 1
aborting job:
Fatal error in PMPI_Init: Other MPI error, error stack:
Fri Mar  1 09:11:30 2024: [PE_0]:inet_listen_socket_setup:bind() failed [fd=3, port=11998 err='Address already in use']
MPIR_Init_thread(170): 
Fri Mar  1 09:11:30 2024: [PE_0]:_pmi_inet_listen_socket_setup:socket setup failed
MPID_Init(441).......: 
Fri Mar  1 09:11:30 2024: [PE_0]:control_nets_listen:_pmi_inet_listen_socket_setup (full) returned -1
MPIR_pmi_init(110)...: PMI_Init returned 1
MPICH ERROR [Rank 0] [job id unknown] [Fri Mar  1 09:11:30 2024] [tioga14] - Abort(1092879) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170): 
MPID_Init(441).......: 
MPIR_pmi_init(110)...: PMI_Init returned 1
aborting job:
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170): 
MPID_Init(441).......: 
MPIR_pmi_init(110)...: PMI_Init returned 1
flux-shell[0]: FATAL: doom: rank 2 exited and exit-timeout=30s has expired
MPI VERSION    : CRAY MPICH version 8.1.28.15 (ANL base 3.4a2)
job.exception: type=exec severity=0 rank 2 exited and exit-timeout=30s has expired
MPI BUILD INFO : Wed Nov 15 20:31 2023 (git hash 1cde46f)
MPI VERSION    : CRAY MPICH version 8.1.28.15 (ANL base 3.4a2)
MPI BUILD INFO : Wed Nov 15 20:31 2023 (git hash 1cde46f)
MPI VERSION    : CRAY MPICH version 8.1.28.15 (ANL base 3.4a2)
MPI BUILD INFO : Wed Nov 15 20:31 2023 (git hash 1cde46f)
MPI VERSION    : CRAY MPICH version 8.1.28.15 (ANL base 3.4a2)
MPI BUILD INFO : Wed Nov 15 20:31 2023 (git hash 1cde46f)
Hello from local rank 0 (global rank 0) on tioga14 vcpu 60
Hello from local rank 0 (global rank 2) on tioga15 vcpu 60
Hello from local rank 1 (global rank 1) on tioga14 vcpu 61
Mar 01 09:12:26.465533 broker.err[0]: rc2.0: /var/tmp/fluxci/flux-ayLz3h/jobtmp-0-ft3eGmd/script cce cray-mpich Exited (rc=255) 61.9s
Hello from local rank 1 (global rank 3) on tioga15 vcpu 61```