E3SM-Project / scream

Exascale global atmosphere model written in C++ as part of the E3SM project
https://e3sm-project.github.io/scream/
Other
72 stars 50 forks source link

ELM hangs while writing output/restarts due to MPI_bcast flooding #1920

Closed ndkeen closed 2 months ago

ndkeen commented 1 year ago

I occasionally see (various) jobs hang on pm (pm-cpu or pm-gpu) and I've been trying to debug in general, but this one seems different. If I run a basic ne30 case asking for a restart and end of case -- it seems to work. But if I ask for 2 restarts in the same job submission, it's often hanging on the second restart write. And so it's always been while trying to write a file such as:

f30cpu.F2010-SCREAMv1.ne30pg2_ne30pg2.bspa.gnu.n022a128x1cXi2.pk1.30d.wr.elm.r.0001-01-11-00000.nc

I also encountered a hang on the first restart attempt, but I think that was after reading in a restart (ie using CONTINUE_RUN=TRUE).

I've tried a few different scenarios, but here is a case where I asked to run for 30 days, with restarts every 5th day.

/global/cfs/cdirs/e3sm/ndk/e3sm_scratch/pm-cpu/bspa/f30cpu.F2010-SCREAMv1.ne30pg2_ne30pg2.bspa.gnu.n022a128x1cXi2.pk1.30d.wr

ambrad commented 1 year ago

Have you tried reproducing this on e.g. Chrysalis?

whannah1 commented 1 year ago

Myself and @lee1046 have had several hanging runs on pm-gpu with E3SM-MMF. We haven't been able to narrow down the problem much, but it does happen at the end of a run, presumably when restarts are being written. We've been crossing our fingers that simply updating our branch after pscratch comes back up will just make the problem go away. It's easily repeatable in our case, and I don't think it matters if it's a second restart or a continuation of a previous run.

ndkeen commented 1 year ago

I have another case where this happened. This was on pm-gpu at ne120 with 32 nodes. I attempted to run for 1 year, with restarts every month. While writing the elm.h0.0001-01.nc file for the first month, it is hanging. I need to verify if maybe this is always where it was hanging.

/global/cfs/cdirs/e3sm/ndk/e3sm_scratch/pm-gpu/se29-sep11/f120.F2010-SCREAMv1-noAero.ne120pg2_r0125_oRRS18to6v3.se29-sep11.gnugpu.1y.n032a4xX18888c2.soc96n1.w500.t100.FIo.wr

Here is a glimpse of where the code is while hanging:

#0  0x0000150f92cfa950 in MPIDI_CRAY_Common_lmt_progress () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#1  0x0000150f92cefdb9 in MPIDI_SHMI_progress () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#2  0x0000150f9179b629 in MPIR_Wait_impl.part.0 () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#3  0x0000150f9254e3a6 in MPIC_Wait () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#4  0x0000150f92560c11 in MPIC_Sendrecv () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#5  0x0000150f92468a2c in MPIR_Allreduce_intra_recursive_doubling () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#6  0x0000150f90a490b1 in MPIR_Allreduce_intra_auto () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#7  0x0000150f90a49295 in MPIR_Allreduce_impl () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#8  0x0000150f9274dee7 in MPIR_CRAY_Allreduce () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#9  0x0000150f90aa9b81 in PMPI_Allreduce () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#10 0x0000000000f70f3b in flush_output_buffer ()
#11 0x0000000000f62b76 in PIOc_put_vars_tc ()
#12 0x0000000000f639df in PIOc_put_var_tc ()
#13 0x0000000000f15bc8 in __pionfput_mod_MOD_put_var_1d_double ()
#14 0x0000000000b639fa in __ncdio_pio_MOD_ncd_io_1d_double_glob ()
#15 0x00000000005e8e94 in __histfilemod_MOD_htape_timeconst ()
#16 0x0000000000605e1b in __histfilemod_MOD_hist_htapes_wrapup ()
#17 0x00000000005a3f02 in __elm_driver_MOD_elm_drv ()
#18 0x000000000058baea in __lnd_comp_mct_MOD_lnd_run_mct ()
#19 0x00000000004f2b69 in __component_mod_MOD_component_run ()
#20 0x00000000004dcbb0 in __cime_comp_mod_MOD_cime_run ()
#21 0x00000000004b09ae in main ()
lee1046 commented 1 year ago

Not sure if it is related but I recently had a similar issue with my simulation hanging on Perlmutter GPU. I and @whannah1 narrowed it down that it was elm.h0 file that caused the hanging. I had a conversation with @jayeshkrishna and @Danqing Wu about this issue. They came up with a solution to work around it. The solution was to apply a recent Scorpio patch and use PIO_TYPE = 1 (scorpio_classic). This is the scorpio patch that I applied. https://github.com/E3SM-Project/scorpio/pull/479

ndkeen commented 1 year ago

Thanks @lee1046 -- I do see that at least the most recent cases with this issue, it is hanging during the writing of elm.h0.0001-01.nc file at end of month.

It seems pretty odd that the solution is to use PIO1.

Reminder to myself to try the following which might avoid writing elm.h0 file

hist_nhtfrq = -175200     ! Output frequency =  average over 20 year (24*20*365)
hist_mfilt = 1        ! History file has 1 time sample 
ndkeen commented 1 year ago

Update: I have found that we also need to avoid the very first elm.h0 file. After doing this, my pm-gpu cases are getting beyond the hang. Obv not a great solution.

cat <<EOF >> user_nl_elm
hist_nhtfrq = -999999999  ! Output frequency
hist_mfilt = 1            ! History file has 1 time sample
hist_empty_htapes = .true.
EOF

This "worked" for ne120 and ne256 cases with scream master of Sep 11th.

rljacob commented 1 year ago

Can you run ne120pg2_r0125_oRRS18to6v3.WCYCL1950 on perlmutter with land output? That case is in our high-res test suite although it doesn't output elm files.

ndkeen commented 1 year ago

I can run WCYCL1950.ne120pg2_r0125_oRRS18to6v3 for 1 day and no IO on pm-cpu (this is with custom PE layout as we don't yet have many defaults for the machine yet). I can try flipping on some output? And I can run e3sm_developer on pm-cpu which does write many elm.h0 files, but I think those are all ne30 or smaller.

bishtgautam commented 1 year ago

Based on the description of the Scorpio patch (https://github.com/E3SM-Project/scorpio/pull/479), it seems that ELM is reading a variable for which memory isn't allocated correctly. It is strange that reading a variable and not writing a variable is causing ELM to hang.

@jayeshkrishna @dqwu, Am I correctly understanding the Scorpio patch?

dqwu commented 1 year ago

Based on the description of the Scorpio patch (E3SM-Project/scorpio#479), it seems that ELM is reading a variable for which memory isn't allocated correctly. It is strange that reading a variable and not writing a variable is causing ELM to hang.

@jayeshkrishna @dqwu, Am I correctly understanding the Scorpio patch?

@bishtgautam That patch is for scorpio classic. The hanging issue is only reproduced with scorpio so far.

whannah1 commented 1 year ago

@ndkeen are you using threads on your runs? Have you tried to reproduce with a single threaded case? I only ask because I've been having issues on Summit lately, and at least one of those issues came down to a threading problem that causes variables on certain threads to be prematurely deallocated. Given Gautam's comment I'm wondering if these things are related...

ndkeen commented 1 year ago

I normally run with threads, but trying without, I still see the same hang.

Actually, one possibility could be the different filesystems. In recent weeks, scratch has been unavailable (or degraded) and I've been using CFS which is GPFS and has some known peculiar properties for writing in parallel. Easy to just try a test using scratch, of course, but I've been able to get a job to run there for some reason (either the Q or something preventing jobs that use scratch to run while it's degraded).

ndkeen commented 1 year ago

I finally tested using PM scratch. I used a newer scream repo as well -- from Sep29th. And both ne120 and ne256 complete 1 day without restart, but both hang when I try to ask for a restart. They appear to be stuck during the elm.r file writing. This is on pm-gpu.

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se32-sep29/f120.F2010-SCREAMv1-noAero.ne120pg2_r0125_oRRS18to6v3.se32-sep29.gnugpu.1d.n048a4x1.so96n4.wr

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se32-sep29/f256.F2010-SCREAMv1-noAero.ne256pg2_r0125_oRRS18to6v3.se32-sep29.gnugpu.1d.n048a4x16ci2.so144n1.wr

And ne30 restart writing still works OK. /pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se32-sep29/f30.F2010-SCREAMv1.ne30pg2_ne30pg2.se32-sep29.gnugpu.1m.n001a4xX14444c2.so108n8.wr

bishtgautam commented 1 year ago

We have the same issue being tracked in two places: here and in https://github.com/E3SM-Project/E3SM/issues/5197.

@ndkeen, This is probably an issue in e3sm master itself and @dqwu is adding his findings in https://github.com/E3SM-Project/E3SM/issues/5197. Should we work with e3sm master to fix this issue?

ndkeen commented 1 year ago

Note that my cases are using only 48 nodes of pm-gpu. Could probably reproduce with 24 nodes. If that helps debug.

whannah1 commented 1 year ago

FYI I'm getting the hang with 32 nodes on pm-gpu with an MMF case @ ne30pg2 EDIT: I also just tried a single node ne4pg2 case and it did not hang.... not sure what to make of that.

ndkeen commented 1 year ago

If one theory is a variables size is too large when writing out, is there a way to artificially make a variable much larger so we can try to reproduce at ne30?

whannah1 commented 1 year ago

@ndkeen with the MMF you can make the CRM size as large as you want, which will create a lot of data to write to the restart files. Hit me up on slack if you want to try something like that.

ndkeen commented 1 year ago

Trying again with a machine almost identical to pm-cpu, I still see the hanging with ne256 problem. This is using Oct 10th scream repo and a manual update on the scorpio repo via git pull master origin inside of externals/scorpio on Oct 11th. The last files written look like:

-rw-rw-r--  1 ndk ndk 4070584944 Oct 11 14:26 f256.F2010-SCREAMv1-noAero.ne256pg2_r0125_oRRS18to6v3.se36-oct10.gnu.1d.n048a128x1c8.pk8.wrb.scream.hi.0001-01-01-79200.nc
-rw-rw-r--  1 ndk ndk     174646 Oct 11 14:28 ocn.log.49034.221011-135930
-rw-rw-r--  1 ndk ndk        257 Oct 11 14:28 rpointer.ice
-rw-rw-r--  1 ndk ndk      29804 Oct 11 14:28 ice.log.49034.221011-135930
-rw-rw-r--  1 ndk ndk       9550 Oct 11 14:28 atm.log.49034.221011-135930
-rw-rw-r--  1 ndk ndk 1034115120 Oct 11 14:28 f256.F2010-SCREAMv1-noAero.ne256pg2_r0125_oRRS18to6v3.se36-oct10.gnu.1d.n048a128x1c8.pk8.wrb.cice.r.0001-01-02-00000.nc
-rw-rw-r--  1 ndk ndk      76063 Oct 11 14:28 lnd.log.49034.221011-135930
-rw-rw-r--  1 ndk ndk      67320 Oct 11 14:28 f256.F2010-SCREAMv1-noAero.ne256pg2_r0125_oRRS18to6v3.se36-oct10.gnu.1d.n048a128x1c8.pk8.wrb.elm.rh0.0001-01-02-00000.nc
-rw-rw-r--  1 ndk ndk      89676 Oct 11 14:28 f256.F2010-SCREAMv1-noAero.ne256pg2_r0125_oRRS18to6v3.se36-oct10.gnu.1d.n048a128x1c8.pk8.wrb.elm.r.0001-01-02-00000.nc
wrliugit commented 1 year ago

I run a SCREAM RRM simulation on pm and met a similar issue. The simulation generated all atm outputs successfully and died at the end of the simulation when generating restart files. This is the first attempt to write a restart.

The last few files generated are:

-rw-r--r-- 1 wrliu wrliu         257 Oct 22 15:35 rpointer.ice
-rw-r--r-- 1 wrliu wrliu     1517464 Oct 22 15:35 ice.log.3458129.221022-042022
-rw-r--r-- 1 wrliu wrliu  1034115120 Oct 22 15:35 SCREAMv01.conus_ne32x32_pg2.F2010-SCREAM-HR-DYAMOND2.20221018.gre_c.uv.cice.r.2019-11-16-00000.nc
-rw-r--r-- 1 wrliu wrliu       80992 Oct 22 15:35 SCREAMv01.conus_ne32x32_pg2.F2010-SCREAM-HR-DYAMOND2.20221018.gre_c.uv.elm.rh0.2019-11-16-00000.nc
-rw-r--r-- 1 wrliu wrliu      108052 Oct 22 15:35 SCREAMv01.conus_ne32x32_pg2.F2010-SCREAM-HR-DYAMOND2.20221018.gre_c.uv.elm.r.2019-11-16-00000.nc
-rw-r--r-- 1 wrliu wrliu     1204106 Oct 22 15:35 lnd.log.3458129.221022-042022
-rw-r--r-- 1 wrliu wrliu     1553284 Oct 22 15:36 e3sm.log.3458129.221022-042022

The error info in the e3sm.log is

 1536: MPICH ERROR [Rank 1536] [job id 3458129.0] [Sat Oct 22 15:35:58 2022] [nid005474] - Abort(134243855) (rank 1536 in comm 0): Fatal error in MPIR_CRAY_Bcast_Tree: Other MPI error, error stack:
 1536: MPIR_CRAY_Bcast_Tree(183): message sizes do not match across processes in the collective routine: Received -32766 but expected 1
 1536:
 1536: aborting job:
 1536: Fatal error in MPIR_CRAY_Bcast_Tree: Other MPI error, error stack:
 1536: MPIR_CRAY_Bcast_Tree(183): message sizes do not match across processes in the collective routine: Received -32766 but expected 1

The last few lines of land.log are

 Opened file ./SCREAMv01.conus_ne32x32_pg2.F2010-SCREAM-HR-DYAMOND2.20221018.gre_c.uv.elm.r.2019-11-16-00000.nc to write         137
 htape_create : Opening netcdf rhtape ./SCREAMv01.conus_ne32x32_pg2.F2010-SCREAM-HR-DYAMOND2.20221018.gre_c.uv.elm.rh0.2019-11-16-00000.nc
 Opened file ./SCREAMv01.conus_ne32x32_pg2.F2010-SCREAM-HR-DYAMOND2.20221018.gre_c.uv.elm.rh0.2019-11-16-00000.nc to write         138
 htape_create : Successfully defined netcdf restart history file            1
whannah1 commented 1 year ago

@wrliugit @ndkeen I've been doing some very detailed digging into this problem last week along with @dqwu and @jayeshkrishna and I was actually able to find zero in on a problem that could be "fixed" by adding an MPI barrier call. I'm still running more tests, and I just had a coupled case fail, but if you want to try my quick fix you just need to open this file: components/elm/src/biogeophys/SurfaceAlbedoType.F90 and add this line: call mpi_barrier(mpicom,ier) as the first call in the "Restart" subroutine, as well as a declaration for the error code: integer :: ier The real fix probably needs to happen inside scorpio, but I'm hoping a variation of this fix can allow us to keep running experiments in the meantime.

whannah1 commented 1 year ago

I'm noting this on several issues that describe hanging runs on Perlmutter - I've just verified a fix/workaround suggested by @jayeshkrishna in several different compsets (F2010, WCYCL, MMF). Just need to add these environment variables in config_machines.xml:

<env name="MPICH_COLL_OPT_OFF">1</env>
<env name="MPICH_SHARED_MEM_COLL_OPT">0</env>
PeterCaldwell commented 1 year ago

Oh cool. Do you know what performance impact this workaround has? It isn't switching back to old PIO, is it?

ndkeen commented 1 year ago

I tried turning off mpi collective optimizations via those env vars and submitted a ne120 case for 5 days with restart. I still see the same hanging which appears to be related to LND files. This was on 43 nodes of pm-cpu.

/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/se39-oct28/f120.F2010-SCREAMv1.ne120pg2_r0125_oRRS18to6v3.se39-oct28.gnu.5d.n043a128x1c8.pk8.wr.nocoll

I repeated this experiment using Oct 29th scream repo, with an updated PIO submodule and see same results. And that includes using the MPICH env vars.

I also tried a ne30 case for 5 months, writing restarts every month. The first month was OK, but it hung while writing elm files on second restart (or elm.h0).

whannah1 commented 1 year ago

@PeterCaldwell Not sure about the performance impact. I'll need to test on another machine that doesn't hang in order to measure this.

@ndkeen that's interesting that you were able to write the first restart in that last case. I've found from my print statements that some of the processes hang at one point and the reset hang at a later point. So it was tricky to figure out whether it's the "r", "rh0", or "h0" files, but the "r" restart files come first so I was eventually able to show that that file was the real culprit even though the later parts of the log file were clearly past the creation of the "r" file.

BTW, I don't think you need to run nearly that long. You should be able to reproduce the problem by running a few time steps. Also, it seems ne30 reproduces just as well as ne120.

jayeshkrishna commented 1 year ago

Also, to answer @PeterCaldwell 's second question, @whannah1 was able to get successful runs (no hang) with SCORPIO and the MPI library settings (setting the env vars, MPICH_*). So he did not have to use the older version of the I/O library to get successful runs. @dqwu : Can you try the case that @ndkeen is trying out? Try the following env vars for your runs (turning off optimizations & bumping up internal MPI library buffers),

MPICH_GPU_SUPPORT_ENABLED = 0
MPICH_OPTIMIZED_MEMCPY = 0
MPICH_RMA_MAX_PENDING = 128
MPICH_RMA_SHM_ACCUMULATE = 0
MPICH_SHM_PROGRESS_MAX_BATCH_SIZE = 32
MPICH_SMP_SINGLE_COPY_MODE = NONE

MPICH_COLL_OPT_OFF = 1
MPICH_SHARED_MEM_COLL_OPT = 0
MPICH_MPIIO_HINTS_DISPLAY = 1

FI_OFI_RXM_BUFFER_SIZE = 32728
FI_OFI_RXM_SAR_LIMIT = 0
FI_OFI_RXM_RX_SIZE = 8192
FI_OFI_RXM_USE_SRX = 0
FI_VERBS_PREFER_XRC = 0

Also try adding the following env var into the list above if the runs still hang,

MPICH_COLL_SYNC = 1
dqwu commented 1 year ago

@ndkeen Could you please share one of your latest test scripts that has the hanging issue on Perlmutter? Thanks.

ndkeen commented 1 year ago

Here is a reproducer /global/u1/n/ndk/cn-cpu-f120-simplec1.csh

However, I think I have a work-around myself now. Using some libfabric env vars that were already suggested to me by HP to solve a different issue, it appears to be working now.

ndkeen commented 1 year ago

With the following env vars, I can run the ne120 problem (above) and write restarts after 5 days as well as the same ne30 problem for 3 months, writing monthly restarts. So I think this is promising.

setenv FI_CXI_DEFAULT_CQ_SIZE 71680
setenv FI_CXI_CQ_FILL_PERCENT 90
setenv FI_CXI_REQ_BUF_SIZE 12582912
setenv FI_UNIVERSE_SIZE 4096

I'm being told that setenv FI_CXI_CQ_FILL_PERCENT 90 is no longer needed. Will have to test without.

whannah1 commented 1 year ago

in regards to @PeterCaldwell 's question about performance impact of these options I have some better data. These numbers are from 1-day runs with F2010 @ne30pg2 on 32 nodes:

5.70 sypd    using MPICH_COLL_SYNC=1
7.41 sypd    MPICH_COLL_OPT_OFF=1 & MPICH_SHARED_MEM_COLL_OPT=0
6.59 sypd    patched version of scorpio_classic
PeterCaldwell commented 1 year ago

@whannah1 - so what is the timing for the default configuration without any fixes?

ndkeen commented 1 year ago

Note that Walter's suggestion to use the MPICH vars do NOT fix the issue I'm having. Does it make sense to move discussion about performance of these vars to a different issue?

whannah1 commented 1 year ago

@PeterCaldwell I can't run the default config because it hangs.

whannah1 commented 1 year ago

@ndkeen it actually seems that these fixes do not work for a much longer MMF run (need to re-verify this though), so it might still be the same set of symptoms.

ndkeen commented 1 year ago

I'd be curious if the libfabric env vars work. Though I've no intuition on how they might work with a GPU run. I'm tempted to say these may be geared toward helping CPU runs.

whannah1 commented 1 year ago

So another confusing observation about this problem - my MMF cases with 32 nodes will run when changing the env flags, but when I bump it up to 64 nodes I get the same hanging behavior... I already asked Jayesh and Danqing about this but they didn't have any ideas about why this would be the case. This could explain why the flags I've been using don't work for Noel?

Anyway, I also have data from Cori to get a sense of how these flags affect the performance, but keep in mind these are still 1-day runs with F2010 on 32 nodes, so there's certainly "noise" in these estimates:

2.10 sypd   default config
1.86 sypd   using MPICH_COLL_SYNC=1
2.30 sypd   MPICH_COLL_OPT_OFF=1 & MPICH_SHARED_MEM_COLL_OPT=0
1.53 sypd   patched version of scorpio_classic

So it seems that the second "fix" doesn't affect the throughput significantly.

dqwu commented 1 year ago

I run a SCREAM RRM simulation on pm and met a similar issue. The simulation generated all atm outputs successfully and died at the end of the simulation when generating restart files. This is the first attempt to write a restart.

The last few files generated are:

-rw-r--r-- 1 wrliu wrliu         257 Oct 22 15:35 rpointer.ice
-rw-r--r-- 1 wrliu wrliu     1517464 Oct 22 15:35 ice.log.3458129.221022-042022
-rw-r--r-- 1 wrliu wrliu  1034115120 Oct 22 15:35 SCREAMv01.conus_ne32x32_pg2.F2010-SCREAM-HR-DYAMOND2.20221018.gre_c.uv.cice.r.2019-11-16-00000.nc
-rw-r--r-- 1 wrliu wrliu       80992 Oct 22 15:35 SCREAMv01.conus_ne32x32_pg2.F2010-SCREAM-HR-DYAMOND2.20221018.gre_c.uv.elm.rh0.2019-11-16-00000.nc
-rw-r--r-- 1 wrliu wrliu      108052 Oct 22 15:35 SCREAMv01.conus_ne32x32_pg2.F2010-SCREAM-HR-DYAMOND2.20221018.gre_c.uv.elm.r.2019-11-16-00000.nc
-rw-r--r-- 1 wrliu wrliu     1204106 Oct 22 15:35 lnd.log.3458129.221022-042022
-rw-r--r-- 1 wrliu wrliu     1553284 Oct 22 15:36 e3sm.log.3458129.221022-042022

The error info in the e3sm.log is

 1536: MPICH ERROR [Rank 1536] [job id 3458129.0] [Sat Oct 22 15:35:58 2022] [nid005474] - Abort(134243855) (rank 1536 in comm 0): Fatal error in MPIR_CRAY_Bcast_Tree: Other MPI error, error stack:
 1536: MPIR_CRAY_Bcast_Tree(183): message sizes do not match across processes in the collective routine: Received -32766 but expected 1
 1536:
 1536: aborting job:
 1536: Fatal error in MPIR_CRAY_Bcast_Tree: Other MPI error, error stack:
 1536: MPIR_CRAY_Bcast_Tree(183): message sizes do not match across processes in the collective routine: Received -32766 but expected 1

The last few lines of land.log are

 Opened file ./SCREAMv01.conus_ne32x32_pg2.F2010-SCREAM-HR-DYAMOND2.20221018.gre_c.uv.elm.r.2019-11-16-00000.nc to write         137
 htape_create : Opening netcdf rhtape ./SCREAMv01.conus_ne32x32_pg2.F2010-SCREAM-HR-DYAMOND2.20221018.gre_c.uv.elm.rh0.2019-11-16-00000.nc
 Opened file ./SCREAMv01.conus_ne32x32_pg2.F2010-SCREAM-HR-DYAMOND2.20221018.gre_c.uv.elm.rh0.2019-11-16-00000.nc to write         138
 htape_create : Successfully defined netcdf restart history file            1

@wrliugit It seems that we have two possible workarounds so far. Workaround 1 (tested with some MMF runs) setenv MPICH_COLL_SYNC MPI_Bcast This directly adds barrier to all MPI_Bcast calls to avoid your encountered errors or possible hanging issues.

Workaround 2 (suggested by NERSC and Noel)

setenv FI_CXI_DEFAULT_CQ_SIZE 71680
setenv FI_CXI_REQ_BUF_SIZE 12582912
setenv FI_UNIVERSE_SIZE 4096

This might also affect the behaviors of MPI_Bcast calls on top of Fabrics framework.

Could you please rerun your failed test with each of them? Thanks. We can see if they can both make your test pass and find out which one has less impact on the performance.

rljacob commented 1 year ago

Has anyone looked at what ELM is doing in its history write routines? Why is ELM and not another model's output causing this? cc @bishtgautam

whannah1 commented 1 year ago

I've been looking into with help from @dqwu. One problem that we think we've identified is duplicate calls to define variables attributes in a file, such as the "units" attribute. This happens in the ELM code due to how the restart/history files are handled. This seems to be behind the excessive MPI_Bcast calls that ultimately cause the run to hang. Configuring the run or modifying the code to reduce the number of those calls can fix the hang in certain circumstances. This would still obviously happen on any machine, so it seems there's some problem with the libraries being used on PM. @dqwu can explain it better than I can.

EDIT: at one point we we identified a simple way to disable the duplicate attribute definitions, but this wasn't a reliable method for stopping the hangs because other things influence the number of MPI_Bcast calls.

wrliugit commented 1 year ago

@dqwu I moved my simulations to cori so didn't run anything on pm recently. I will stay on cori for the rest of this year. But I could try the two methods in my old pm tests to make the debugging easier. My old runs are too long so I might only run a short test using the same grid, and see whether the restart files are ok. I will let you know when it's done. Thanks.

dqwu commented 1 year ago

MPI standard only requires that MPI_Bcast is called collectively. Whether the call is also synchronized or not is implementation dependent.

Developers have reported confirmed MPI_Bcast flooding issues, especially with Open MPI.

https://users.open-mpi.narkive.com/uGZNKheP/ompi-program-hangs-in-mpi-bcast

Finally, just as a wild guess, we inserted
'mpi_barrier' calls just before the 'mpi_bcast' calls, and the program
now runs without problems.
...
My only explanation is that some internal resource gets
exhausted because of the large number of 'mpi_bcast' calls in rapid
succession, and the barrier calls force synchronization which allows the
resource to be restored.

https://stackoverflow.com/questions/60880159/mpi-bcast-hanging-sometimes

if you are using Open MPI or its derivative, the root rank might be much faster than
the other ranks and hence flooding them. If adding MPI_Barrier(MPI_COMM_WORLD)
before MPI_Bcast() gets rid of the hang, then you should consider using the coll/sync
module (it will automatically do that for you)
...
flooding can occur when the MPI library makes no control flow, and the root process
calls MPI_Bcast() many times in a row, generating a lot of unexpected messages on
the other ranks and hence causing all kind of problems (memory consumption, slowdown, ...)

For one E3SM MMF case run with 128 tasks on Perlmutter, there is a reproducible hanging issue inside SCORPIO, and there are more than 130K MPI_Bcast calls being made.

Some workarounds are supported by Cray MPI and Open MPI MPICH_COLL_SYNC (used by Cray MPI) or OMPI_MCA_coll_sync_barrier_before (used by Open MPI) can automatically add MPI_Barrier before MPI_Bcast.

Alternatively, NERSC has suggested some libfabric env vars (mentioned and tested by Noel), which might affect the behaviors of MPI_Bcast calls on top of Fabric Framework. While these env vars seem to avoid hanging for some E3SM cases we have run recently, it is not as reliable as MPICH_COLL_SYNC. It might hang for some other E3SM cases (not confirmed so far), and it also depends on the specific libfabric lib installed on NERSC.

dqwu commented 1 year ago

Has anyone looked at what ELM is doing in its history write routines? Why is ELM and not another model's output causing this? cc @bishtgautam

Maybe ELM has invoked more inq calls inside SCORPIO. For each inq call, the results are obtained on IO tasks and then broadcast from IO root to other tasks. The hanging is only reproducible when there are many MPI_Bcast calls in SCORPIO. For a specific case, when we remove some unnecessary MPI_Bcast calls in SCORPIO code, the hanging is gone.

ndkeen commented 1 year ago

We think we have a work-around for this issue that also seems to work for other hanging issues on PM. It will need to be upstreamed to scream, of course. https://github.com/E3SM-Project/E3SM/pull/5291

PeterCaldwell commented 1 year ago

I'm reopening this issue because @elynnwu also seems to be experiencing this issue on intel compiler using mvapich2 MPI... which illustrates that this issue is going to keep showing up every time we try to run with a different MPI implementation. Wouldn't it make more sense and be straightforward to add MPI barriers in front of the offensive MPI_bcast calls? @bishtgautam - would this be hard to do?

whannah1 commented 1 year ago

@PeterCaldwell I recall trying to do exactly that and finding that it wasn't a complete solution. I forget the details though, maybe one of my comments above mentions it.

bishtgautam commented 1 year ago

ELM is doing a round-robin domain decomposition. Would having a different decomposition help? I can implement a new domain decomposition ngrids/NTASKS. @whannah1 @dqwu Any thoughts?

rljacob commented 1 year ago

Yes a different decomposition would definitely help. ngrids/NTASKS would be better.

sarats commented 1 year ago

Re-iterating from the related issue https://github.com/E3SM-Project/E3SM/issues/5554

A suggestion: Add a barrier periodically (every n-steps etc.) in the land driver when performing high-frequency I/O to flush the communication queues/buffers. It localizes and minimizes the sync overhead to just land I/O and allows fine-tuning as needed.

Maybe the I case invokes a lot of pio_inq calls in SCORPIO which use MPI_Bcast.

Something to follow-up.

bishtgautam commented 1 year ago

@PeterCaldwell, @elynnwu, I have #5690 that adds a new ELM domain decomposition algorithm and it can be activated by the following change:

cat >> user_nl_elm << EOF
domain_decomp_type = 'simple'
EOF

I have only tested the PR with one thread being used. Would you like to test the branch to see if it fixes the issue?

PeterCaldwell commented 1 year ago

Great, thanks Gautam! Do you expect this new decomposition to be much slower than the old one?