MPI hang when using many tasks on pm-cpu

With pm-cpu, we see that using 128 MPI's per node is best performer for ne30 at low node counts (not surprising as there are 128 cores). However, as I increase the number of nodes, I'm unable to run with this PE layout. Note that I also have major performance issues when strong scaling in this way (ie mpi-packed nodes) using vanilla e3sm F cases, so it could be same sort of issue. But because my scream runs are timing out at same location, I wanted to make issue.

I can run with 1,2,4,8,16 nodes using 128x1 (or 128x2). But with 32 nodes I can't. With 43 nodes, I also can't run with 64 MPI's per node and need 32 MPI's. With 85 nodes, I actually need to use 16 MPI's per node to see it complete.

It makes me think it is hanging somewhere

   0: WARNING: SPA Remap File has been set to 'NONE', assuming that SPA data and simulation are on the same grid - skipping horizontal interpolation
   0: Create Pool
   0: [EAMXX] initialize_atm_procs ... done!
   0: [EAMxx::init] resolution-dependent device memory footprint: 4.414032MB
   0: slurmstepd: error: *** STEP 2315314.0 ON nid004921 CANCELLED AT 2022-06-08T04:28:43 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

Noting I can reproduce this issue with

create_test SMS_Ln24_P4096x1.ne30_ne30.F2010-SCREAMv1.pm-cpu_gnu

which will use 64 MPI's per node (which is current default)

You probably have already done this, but just as a sanity check, try running standalone homme first in such layouts? Same about scalability, before running scalability for e3sm, it would be good to confirm scalability of homme. If you did so, were results good?

I agree with Oksana.

There are some unscalable operations in EAM, EAMxx, and (to a much smaller extent since late 2018) Homme. It's possible that memory is blowing up during initialization that scales with the number of tasks, within each task. It's also possible that EAMxx is even worse than EAM in this regard. Thus, in addition to running standalone Homme, you could also see if SCREAMv0 fails on 43x128x2.

Sure it would be good to run stand-alone HOMME on pm-cpu first (among other tests and basic engineering), I've just not had the time luxury. I'm actually running into various potentially-mpi-related-limits on a few machines. I don't want to lump these all together into one issue as I'm still trying to learn more about it, but wanted to say I certainly see there could be a connection. I did try to run v0, but am running into some issues. I can start documenting the issues I'm seeing, but it sounds like Andrew might already know about some "unscalable operations"? Another data point is that I recently ran some tests on chrysalis and was able to run ne30 with 85 nodes at 64x1. Chrysalis does use openmpi vs mpich on cori/pm. There are other MPI implementations on cori that I can try.

By "unscalable" ops I mean the ones in EAM's phys/dyn_grid initialization routines. But the Chrysalis data point is quite useful and I think implies that with high probability there's nothing on the app side that is fundamentally the issue.

With ne30 problems such as F2010-SCREAMv1-noAero.ne30_ne30 I'm hitting MPI hangs on pm-cpu. Doesn't need to be without spa. Most cases are OK, but as I increase node count and MPI's this happens.

I can see where it's stopped which is:

#0  0x0000148e664501a9 in MPIDI_SHMI_progress () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#1  0x0000148e64f689d9 in MPIR_Waitall_impl () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#2  0x0000148e64fced71 in MPIR_Waitall () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#3  0x0000148e64fd026e in PMPI_Waitall () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#4  0x0000148e66c7216e in pmpi_waitall__ () from /opt/cray/pe/lib64/libmpifort_gnu_91.so.12
#5  0x0000000000c33890 in ice_boundary::ice_haloupdate2dr8 (array=..., halo=..., fieldloc=1, fieldkind=1, fillvalue=<optimized out>) at /global/cfs/cdirs/e3sm/ndk/se09-jun13/components/cice/src/mpi/ice_boundary.F90:1479
#6  0x0000000000c723bc in ice_grid::makemask () at /global/cfs/cdirs/e3sm/ndk/se09-jun13/components/cice/src/source/ice_grid.F90:1657
#7  0x0000000000c74493 in ice_grid::latlongrid () at /global/cfs/cdirs/e3sm/ndk/se09-jun13/components/cice/src/source/ice_grid.F90:1265
#8  0x0000000000c779e2 in ice_grid::init_grid2 () at /global/cfs/cdirs/e3sm/ndk/se09-jun13/components/cice/src/source/ice_grid.F90:338
#9  0x0000000000cc79d8 in cice_initmod::cice_init (mpicom_ice=-1006632930) at /global/cfs/cdirs/e3sm/ndk/se09-jun13/components/cice/src/drivers/cpl/CICE_InitMod.F90:109
#10 0x0000000000c0dd29 in ice_comp_mct::ice_init_mct (eclock=..., cdata_i=..., x2i_i=..., i2x_i=..., nlfilename=..., _nlfilename=_nlfilename@entry=6) at /global/cfs/cdirs/e3sm/ndk/se09-jun13/components/cice/src/drivers/cpl/ice_comp_mct.F90:242
#11 0x00000000004bed35 in component_mod::component_init_cc (eclock=..., comp=..., comp_init=0xc0d260 <ice_comp_mct::ice_init_mct>, infodata=..., nlfilename=..., seq_flds_x2c_fluxes=..., seq_flds_c2x_fluxes=..., _nlfilename=6, 
    _seq_flds_x2c_fluxes=0, _seq_flds_c2x_fluxes=0) at /global/cfs/cdirs/e3sm/ndk/se09-jun13/driver-mct/main/component_mod.F90:258
#12 0x00000000004af3b7 in cime_comp_mod::cime_init () at /global/cfs/cdirs/e3sm/ndk/se09-jun13/driver-mct/main/cime_comp_mod.F90:1462
#13 0x000000000048b5dd in cime_driver () at /global/cfs/cdirs/e3sm/ndk/se09-jun13/driver-mct/main/cime_driver.F90:122
#14 main (argc=<optimized out>, argv=<optimized out>) at /global/cfs/cdirs/e3sm/ndk/se09-jun13/driver-mct/main/cime_driver.F90:23
#15 0x0000148e62bcc2bd in __libc_start_main () from /lib64/libc.so.6
#16 0x00000000004a0aea in _start () at ../sysdeps/x86_64/start.S:120

And here is a different place it hangs during init:

#0  ofi_cq_read (cq_fid=0x73d36e0, buf=0x7fff1a637060, count=8) at prov/util/src/util_cq.c:283
#1  0x0000147f23dcce42 in MPIR_Wait_impl.part.0 () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#2  0x0000147f24b7bbc6 in MPIC_Wait () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#3  0x0000147f24b82074 in MPIC_Recv () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#4  0x0000147f24d7d9ea in MPIR_CRAY_Bcast_Tree () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#5  0x0000147f24d7e672 in MPIR_CRAY_Bcast () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#6  0x0000147f231f2915 in MPIR_Bcast () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#7  0x0000147f231f4198 in PMPI_Bcast () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12
#8  0x0000147f25b3da45 in pmpi_bcast__ () from /opt/cray/pe/lib64/libmpifort_gnu_91.so.12
#9  0x0000000000c3a891 in ice_broadcast::broadcast_scalar_int (scalar=0, root_pe=<optimized out>) at /global/cfs/cdirs/e3sm/ndk/se09-jun13/components/cice/src/mpi/ice_broadcast.F90:206
#10 0x0000000000c7c8cf in ice_history::init_hist (dt=1800) at /global/cfs/cdirs/e3sm/ndk/se09-jun13/components/cice/src/source/ice_history.F90:360
#11 0x0000000000cc76b7 in cice_initmod::cice_init (mpicom_ice=-1006632943) at /global/cfs/cdirs/e3sm/ndk/se09-jun13/components/cice/src/drivers/cpl/CICE_InitMod.F90:112
#12 0x0000000000c0dd29 in ice_comp_mct::ice_init_mct (eclock=..., cdata_i=..., x2i_i=..., i2x_i=..., nlfilename=..., _nlfilename=_nlfilename@entry=6) at /global/cfs/cdirs/e3sm/ndk/se09-jun13/components/cice/src/drivers/cpl/ice_comp_mct.F90:242
#13 0x00000000004bed35 in component_mod::component_init_cc (eclock=..., comp=..., comp_init=0xc0d260 <ice_comp_mct::ice_init_mct>, infodata=..., nlfilename=..., seq_flds_x2c_fluxes=..., seq_flds_c2x_fluxes=..., _nlfilename=6, 
    _seq_flds_x2c_fluxes=0, _seq_flds_c2x_fluxes=0) at /global/cfs/cdirs/e3sm/ndk/se09-jun13/driver-mct/main/component_mod.F90:258
#14 0x00000000004af3b7 in cime_comp_mod::cime_init () at /global/cfs/cdirs/e3sm/ndk/se09-jun13/driver-mct/main/cime_comp_mod.F90:1462
#15 0x000000000048b5dd in cime_driver () at /global/cfs/cdirs/e3sm/ndk/se09-jun13/driver-mct/main/cime_driver.F90:122
#16 main (argc=<optimized out>, argv=<optimized out>) at /global/cfs/cdirs/e3sm/ndk/se09-jun13/driver-mct/main/cime_driver.F90:23
#17 0x0000147f21a972bd in __libc_start_main () from /lib64/libc.so.6
#18 0x00000000004a0aea in _start () at ../sysdeps/x86_64/start.S:120

I might have figured this out. I think using too many tasks for ICE can be problematic.

If am able to avoid this issue by reducing the tasks used for ICE component. It's not a performance issue. Might leave the issue open as I don't really like relying on PE layout to be correct. Seems like it shouldn't hang (or at least warn or error there is a potential problem with tasks used in ICE).

Hasn't this been an issue for quite some time now. I've definitely run into issues when I pushed the limit on the number of tasks used by the ATM component and had ICE complain unless I changed the PE layout to decrease tasks assigned to ICE. And that was with EAMv0 and EAMv1

You might be thinking of LND. With LND, there is a really annoying limit on the total number of tasks (MPI's times threads), but it will not tell you about that until runtime. For ICE, I've not seen an issue with using too many tasks. Certainly using 5400 tasks for all components of a ne30 F case should be OK.

For me it was definitely in the ICE module. I can't recall what the limit ended up being, but I remember specifically decreasing the number of tasks assigned to ICE to avoid the issue. Maybe it has been resolved since then, this was a few years back.

I still need to use fewer MPI tasks in ICE to avoid issue and see better performance. I also found that I will get an error using more than 5400 MPI's (for ne30) that isn't obvious.

I have since run into other situations that hang on pm-cpu -- including with vanilla E3SM. I assume these are different, but the issue there is likely a network software issue and we have work-arounds with env variables. Additionally, a yet different issue (also seen with vanilla e3sm) can be worked around with using fewer MPI tasks in CPL (and another env variable that is already in E3SM master).

So, just noting that it's possible there will be bugfixes to network software to address all or most of the issues.

I could close the issue as we know how to run, but would rather find a better solution as hangs are really annoying.

These are the specific env vars that work on pm-cpu and alvarez -- I haven't seen any side effects of using them, but they would be temporary.

      <env name="FI_CXI_RX_MATCH_MODE">software</env>
      <env name="FI_CXI_DEFAULT_CQ_SIZE">71680</env>
      <env name="FI_CXI_CQ_FILL_PERCENT">90</env>
      <env name="FI_CXI_REQ_BUF_SIZE">12582912</env>
      <env name="FI_UNIVERSE_SIZE">4096</env>

I still need to use fewer MPI tasks in ICE to avoid issue and see better performance. I also found that I will get an error using more than 5400 MPI's (for ne30) that isn't obvious.

With ne30, we have 5400 2d elements. Homme partitions the 2d grid by element, so more than 5400 MPI ranks would mean some rank has 0 elements, which is highly likely to break something.

Should we add an error message if users try to run with more MPI ranks than we have elements?

Note that we might want to support overdecomposition in the future - it proved useful in some v0 situations where ranks were idle in dynamics but had work in physics. Definitely a v1. task to add this though.

Would there be a way to use additional MPI's for some routines? I think with vanilla e3sm, more than nelem MPI's can be used in the physics and we see a slight improvement.

Regarding an error, I would actually recommend printing a warning and then just using the best possible (max). Unless we could issue the error at build/submit time -- though I'd prefer it letting me do it.

Would there be a way to use additional MPI's for some routines? I think with vanilla e3sm, more than nelem MPI's can be used in the physics and we see a slight improvement.

That seems complicated. I don't know how e3sm does that, but it seems like it might require quite some work.

Regarding an error, I would actually recommend printing a warning and then just using the best possible (max). Unless we could issue the error at build/submit time -- though I'd prefer it letting me do it.

We can certainly add a print error if the atm comm size is larger than the global number of 2d elements.

Ok, could you just smoosh a warning or error message for this case into some other PR, Luca? I suspect it will just be a 1 line change... I don't care between warning or error message.

Noting this is still an issue. I was hoping that adding the MPICH var to place barriers before each bcast (which fixed other hangs on PM) would do the trick.

It may be related to ne30_ne30 ? Maybe we aren't so concerned with this resolution?

These tests still hang:

SMS_Ln24_P4096x1.ne30_ne30.F2010-SCREAMv1.pm-cpu_gnu
SMS_Ln24_P5400x1.ne30_ne30.F2010-SCREAMv1.pm-cpu_gnu

while this test works: SMS_Ln24_P2700x1.ne30_ne30.F2010-SCREAMv1.pm-cpu_gnu

This is no longer hanging.

E3SM-Project / scream

MPI hang when using many tasks on pm-cpu #1731