Closed jli628 closed 2 weeks ago
This error happens form time to time and you should be able to resubmit. It's not related to E3SM -- other apps see the same issue. It has been happening for a while at low frequency. NERSC isn't sure exactly what the issue is, but has some ideas.
Thank you, @ndkeen. I have resubmitted the simulations over ten times, but none of them worked. I am asking the NERSC Help Desk for help. Thank you for the useful information.
Best,
Jianfeng
OK you did not mention that. I can look at your case.
That would be great! The E3SMv2.1 directory is at /pscratch/sd/j/jli628/E3SMv2.1/E3SMv2.1.F2010-CICE.ne30pg2_r0125_oRRS18to6v3.202404262212_noirrigation
I submitted a 30-minute simulation using the debug queue around noontime on Friday. It worked. However, after that, all my E3SMv2.1 and SCREAM simulations crashed with the same MPI error.
Thank you!
I didn't change anything except for some additional outputs in the namelist for the crashed simulations.
Hi I have also experienced this issue this morning on the debug queue on Perlmutter, but a resubmission seems to work for my case. The error message in e3sm.log is
233: Mon Apr 29 10:06:46 2024: [PE_233]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=3, pes_this_node=128, timeout=180 secs 165: Mon Apr 29 10:07:46 2024: [PE_165]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=3, pes_this_node=128, timeout=180 secs 165: Mon Apr 29 10:07:46 2024: [PE_165]:_pmi_mmap_init:Failed to setup PMI mmap.Mon Apr 29 10:07:46 2024: [PE_165]:globals_init:_pmi_mmap_init returned -1 165: MPICH ERROR [Rank 0] [job id unknown] [Mon Apr 29 10:07:46 2024] [nid004474] - Abort(1091855) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: 165: MPIR_Init_thread(170): 165: MPID_Init(441).......: 165: MPIR_pmi_init(110)...: PMI_Init returned 1 165: 165: aborting job: 165: Fatal error in PMPI_Init: Other MPI error, error stack: 165: MPIR_Init_thread(170): 165: MPID_Init(441).......: 165: MPIR_pmi_init(110)...: PMI_Init returned 1 srun: error: nid004474: task 165: Exited with exit code 255 srun: Terminating StepId=24954605.0 0: slurmstepd: error: *** STEP 24954605.0 ON nid004305 CANCELLED AT 2024-04-29T17:07:47 *** 130: forrtl: error (78): process killed (SIGTERM) 130: Image PC Routine Line Source 130: libpthread-2.31.s 000014713EA03910 Unknown Unknown Unknown
I still believe the error to be intermittent.
One thing you can always do it try a simpler test. For example, with E3SM repo:
cd cime/scripts
create_test SMS_D.ne4pg2_oQU480.F2010
And consider even simpler batch job:
#!/bin/csh
#SBATCH --job-name=simple
#SBATCH -q debug
#SBATCH --account=e3sm
#SBATCH --constraint=cpu
#SBATCH --nodes=1
#SBATCH --time=5
date
ls -l $HOME
date
And if that were named sb.csh, sbatch sb.csh
I was able to checkout maint-2.1 and run your script.
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/maint21-apr29/E3SMv2.1.F2010-CICE.ne30pg2_r0125_oRRS18to6v3.202404291038_noirrigation
Note that you have REST_N=1 and REST_OPTION=ndays, which is likely not what you want.
Thank you, @ndkeen and @kchong75. It seems the error only sticks with me. I will do more tests. Thanks.
@ndkeen The issue is gone on Monday evening. Both GNU and intel work now. I don't know what happened. Thank you for your time and help.
Hi,
Did anyone encounter the error below on pm-cpu? The error appeared in my simulations in the afternoon last Friday. Does anyone know how to solve the problem? Thanks.