E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
332 stars 334 forks source link

MPI error for E3SMv2.1 and SCREAM on pm-cpu #6376

Closed jli628 closed 2 weeks ago

jli628 commented 2 weeks ago

Hi,

Did anyone encounter the error below on pm-cpu? The error appeared in my simulations in the afternoon last Friday. Does anyone know how to solve the problem? Thanks.

image
ndkeen commented 2 weeks ago

This error happens form time to time and you should be able to resubmit. It's not related to E3SM -- other apps see the same issue. It has been happening for a while at low frequency. NERSC isn't sure exactly what the issue is, but has some ideas.

jli628 commented 2 weeks ago

Thank you, @ndkeen. I have resubmitted the simulations over ten times, but none of them worked. I am asking the NERSC Help Desk for help. Thank you for the useful information.

Best,

Jianfeng

ndkeen commented 2 weeks ago

OK you did not mention that. I can look at your case.

jli628 commented 2 weeks ago

That would be great! The E3SMv2.1 directory is at /pscratch/sd/j/jli628/E3SMv2.1/E3SMv2.1.F2010-CICE.ne30pg2_r0125_oRRS18to6v3.202404262212_noirrigation

I submitted a 30-minute simulation using the debug queue around noontime on Friday. It worked. However, after that, all my E3SMv2.1 and SCREAM simulations crashed with the same MPI error.

Thank you!

jli628 commented 2 weeks ago

I didn't change anything except for some additional outputs in the namelist for the crashed simulations.

kchong75 commented 2 weeks ago

Hi I have also experienced this issue this morning on the debug queue on Perlmutter, but a resubmission seems to work for my case. The error message in e3sm.log is 233: Mon Apr 29 10:06:46 2024: [PE_233]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=3, pes_this_node=128, timeout=180 secs 165: Mon Apr 29 10:07:46 2024: [PE_165]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=3, pes_this_node=128, timeout=180 secs 165: Mon Apr 29 10:07:46 2024: [PE_165]:_pmi_mmap_init:Failed to setup PMI mmap.Mon Apr 29 10:07:46 2024: [PE_165]:globals_init:_pmi_mmap_init returned -1 165: MPICH ERROR [Rank 0] [job id unknown] [Mon Apr 29 10:07:46 2024] [nid004474] - Abort(1091855) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack: 165: MPIR_Init_thread(170): 165: MPID_Init(441).......: 165: MPIR_pmi_init(110)...: PMI_Init returned 1 165: 165: aborting job: 165: Fatal error in PMPI_Init: Other MPI error, error stack: 165: MPIR_Init_thread(170): 165: MPID_Init(441).......: 165: MPIR_pmi_init(110)...: PMI_Init returned 1 srun: error: nid004474: task 165: Exited with exit code 255 srun: Terminating StepId=24954605.0 0: slurmstepd: error: *** STEP 24954605.0 ON nid004305 CANCELLED AT 2024-04-29T17:07:47 *** 130: forrtl: error (78): process killed (SIGTERM) 130: Image PC Routine Line Source 130: libpthread-2.31.s 000014713EA03910 Unknown Unknown Unknown

ndkeen commented 2 weeks ago

I still believe the error to be intermittent.

One thing you can always do it try a simpler test. For example, with E3SM repo:

cd cime/scripts
create_test SMS_D.ne4pg2_oQU480.F2010

And consider even simpler batch job:

#!/bin/csh
#SBATCH --job-name=simple
#SBATCH -q debug
#SBATCH --account=e3sm
#SBATCH --constraint=cpu
#SBATCH --nodes=1
#SBATCH --time=5
date

ls -l $HOME

date

And if that were named sb.csh, sbatch sb.csh

I was able to checkout maint-2.1 and run your script.

/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/maint21-apr29/E3SMv2.1.F2010-CICE.ne30pg2_r0125_oRRS18to6v3.202404291038_noirrigation

Note that you have REST_N=1 and REST_OPTION=ndays, which is likely not what you want.

jli628 commented 2 weeks ago

Thank you, @ndkeen and @kchong75. It seems the error only sticks with me. I will do more tests. Thanks.

jli628 commented 2 weeks ago

@ndkeen The issue is gone on Monday evening. Both GNU and intel work now. I don't know what happened. Thank you for your time and help.