Extreme output performance degradation at higher core counts in GCHP using EFA fabric

I commented in #652 that I was encountering perpetual hangs at output time in GCHP using Intel MPI and Amazon's EFA fabric provider on AWS EC2. Consecutive 1-hour runs at c90 on 2 nodes would actually alternate hanging perpetually at output time, crashing at output time, and finishing with only a benign end-of-run crash, all without me modifying the environment submitted through Slurm. These issues were fixed by updating libfabric from 1.11.1 to 1.11.2. However, at higher core counts (288 cores across 8 nodes vs. 72 cores across 2 nodes in my original tests), I'm still running into indefinite hangs at output time using EFA with both OpenMPI and IntelMPI. Setting FI_PROVIDER=tcp fixes this issue (for OpenMPI; I get immediate crashes right now for TCP + Intel MPI on AWS), but is not a long-term fix. I've tried updating to MAPL 2.5 and cherry-picking https://github.com/GEOS-ESM/MAPL/commit/eda17539c040f5953c7e0656c342da4826a613bc and https://github.com/GEOS-ESM/MAPL/commit/bb20beeba61430069bf751ac27d89f540862d796 to no avail.

The hang seemingly occurs at o_clients%done_collective_stage() in MAPL_HistoryGridComp.F90. If I turn on libfabric debug logs, I get spammed with millions of lines of libfabric:13761:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted. and libfabric:13761:efa:ep_ctrl:rxr_ep_alloc_tx_entry():479<warn> TX entries exhausted. at this call, with these warnings continuing to be printed in OpenMPI every few seconds (I cancelled my job after 45 minutes, compared to 7 minutes to completion for TCP runs) but stopping indefinitely after one burst for Intel MPI.

I plan to open an issue on the libfabric Github page, but I was wondering if anyone had any suggestions on further additions to MAPL post-2.5 I could try out that might affect this problem, or any suggestions on environment variables to test.

Does that mean a memory issue? If so, I suggest you divide the biggest file into some smaller ones with less variables in it. And please use more servers , i.e., do not use [1,1,1], instead, use [3] . The oserver will distribute the files among different nodes so the output files will not concentrate on one node.

One coincidence. I was helping out Dan Duffy who's trying to run our FV3 Standalone on AWS using Intel MPI and I found an issue by one @WilliamDowns that pointed me to MPIR_CVAR_CH4_OFI_ENABLE_RMA=0 which is letting him run.

Now my great fear is now somehow this flag is causing this issue in MAPL (we are running FV3 without any History or checkpointing because...it's not working well/at all on AWS for some reason).

I can say that on Discover, we've found we need to (sometimes) pass in FI_PSM2_ flags to workaround certain bugs. But that is Omnipath. But I can't see anything in fi_efa(7) that might directly help EFA here. 😦

Luckily I've tried toggling MPIR_CVAR_CH4_OFI_ENABLE_RMA on and off and it doesn't change anything for me anymore (possibly because I'm now running a very up-to-date libfabric / Intel-MPI 2019).

Luckily I've tried toggling MPIR_CVAR_CH4_OFI_ENABLE_RMA on and off and it doesn't change anything for me anymore (possibly because I'm now running a very up-to-date libfabric / Intel-MPI 2019).

Well, I can tell you it's still needed with Intel MPI 2021.1 from oneAPI for our tests!

I tried increasing the number of oservers (tested a range of values all the way up to 24) and increased and decreased npes_backend_per_node without success (actually seemingly ran into hangs at very high server counts where runtime was over 3 hours).

It will hang if the npes of frontend exceed npes of the model. Can you show me your setup of CapOptions?

Here are my current CapOptions (which yield the same error I've been getting):

   type :: MAPL_CapOptions

      integer :: comm
      logical :: use_comm_world = .true.
      character(:), allocatable :: egress_file
      character(:), allocatable :: cap_rc_file
      type (ESMF_LogKind_Flag) :: esmf_logging_mode = ESMF_LOGKIND_NONE
      integer :: npes_model = 288
      ! only one of the next two options can have nonzero values
      integer, allocatable :: npes_input_server(:)
      integer, allocatable :: nodes_input_server(:)
      ! only one of the next two options can have nonzero values
      integer, allocatable :: npes_output_server(:)
      integer, allocatable :: nodes_output_server(:)
      ! whether or not the nodes are padding with idle when mod(model total npes , each node npes) /=0
      logical              :: isolate_nodes = .true.
      ! whether or not copy the data before isend to the oserver
      ! it is faster but demands more memory if it is true
      logical              :: fast_oclient  = .true.
      ! server groups
      integer :: n_iserver_group = 1
      integer :: n_oserver_group = 1
      ! ensemble options
      integer :: n_members = 1
      character(:), allocatable :: ensemble_subdir_prefix
      ! logging options
      character(:), allocatable :: logging_config
      character(:), allocatable :: oserver_type
      integer :: npes_backend_pernode = 2
   end type MAPL_CapOptions

   interface MAPL_CapOptions
      module procedure new_CapOptions
   end interface

contains

   function new_CapOptions(unusable, cap_rc_file, egress_file, ensemble_subdir_prefix, esmf_logging_mode, rc) result (cap_options)
      type (MAPL_CapOptions) :: cap_options
      class (KeywordEnforcer), optional, intent(in) :: unusable
      character(*), optional, intent(in) :: cap_rc_file
      character(*), optional, intent(in) :: egress_file
      character(*), optional, intent(in) :: ensemble_subdir_prefix
      type(ESMF_LogKind_Flag), optional, intent(in) :: esmf_logging_mode

      integer, optional, intent(out) :: rc

      _UNUSED_DUMMY(unusable)

      cap_options%cap_rc_file = 'CAP.rc'
      cap_options%egress_file = 'EGRESS'
      cap_options%oserver_type= 'multigroup'
      !cap_options%oserver_type= 'single'
      cap_options%ensemble_subdir_prefix = 'mem'

      cap_options%npes_input_server  =[0]
      cap_options%nodes_input_server =[0]
      cap_options%npes_output_server =[0]
      !cap_options%nodes_output_server=[8]
      !cap_options%nodes_output_server=[1,1,1,1,1,1,1,1]
      cap_options%nodes_output_server=[4,4]

      if (present(cap_rc_file)) cap_options%cap_rc_file = cap_rc_file
      if (present(egress_file)) cap_options%egress_file = egress_file
      if (present(ensemble_subdir_prefix)) cap_options%ensemble_subdir_prefix = ensemble_subdir_prefix
      if (present(esmf_logging_mode)) cap_options%esmf_logging_mode = esmf_logging_mode

      _RETURN(_SUCCESS)

Anything commented out is a setting I've tried (I've also tried several other combinations of values for nodes_output_server, npes_backend_per_node, and n_oserver_group).

I will also note that the SpeciesConc History output file (the largest) writes up to 15.9GB, then stops for about 30 seconds, then I receive the segfault / HDF5 error (and receive an HDF error when trying to open the file either through Python or using ncdump). The size of a successfully saved version of the file is 16.2GB. Again, this file isn't the only issue since disabling this diagnostic and increasing the number of other History diagnostics also results in a crash, but thought it was interesting.

Maybe the size of 16g is too big? To make sure that the sum of small collection wouldn't exceed the limit, try npes_backend_pernode = 1 . ( I am not sure your version, the assertion npes_backend_pernode > =2 is wrong, just comment that line out)

I set npes_backend_pernode=1 and received an error when starting the run:

pe=00504 FAIL at line=00107    ServerManager.F90                        <captain-soldier need at lease two beckend>
pe=00504 FAIL at line=00193    MAPL_Cap.F90                             <status=1>
pe=00504 FAIL at line=00152    MAPL_Cap.F90                             <status=1>
pe=00504 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00504 FAIL at line=00030    GCHPctm.F90                              <status=1>

Also, I did a run with all HDF5 debug flags enabled. Here's the end of the output + error message:

H5Dclose(dset=360287970189639688 (dset)) = SUCCEED;
H5Dclose(dset=360287970189640008 (dset)) = SUCCEED;
H5Gclose(group=144115188075855872 (group)) = SUCCEED;
H5Fclose(file=72057594037927936 (file)) = FAIL;
H5Fget_obj_count(file=72057594037927936 (file), types=31) = 1;
H5Fget_obj_count(file=72057594037927936 (file), types=31) = 1;
H5Fget_obj_ids(file=72057594037927936 (file), types=1, max_objs=1, oid_list=0x0x90175b0) = 1;
H5Iget_name(id=72057594037927936 (file), name=0x0x7fffd5aac390, size=1024)gchp_bigdebug: H5Groot.c:96: H5G_rootof: Assertion `f->shared' failed.

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x2b8d7360d3ff in ???
#1  0x2b8d7360d387 in ???
#2  0x2b8d7360ea77 in ???
#3  0x2b8d736061a5 in ???
#4  0x2b8d73606251 in ???
#5  0x2b8d6dedc512 in H5G_rootof
        at /tmp/centos/spack-stage/spack-stage-hdf5-develop-a27xz7itza4l2htmlhfkyreobuw372sq/spack-src/src/H5Groot.c:96
#6  0x2b8d6dedd0e2 in H5G_root_loc
        at /tmp/centos/spack-stage/spack-stage-hdf5-develop-a27xz7itza4l2htmlhfkyreobuw372sq/spack-src/src/H5Groot.c:375
#7  0x2b8d6decec98 in H5G_loc_real
        at /tmp/centos/spack-stage/spack-stage-hdf5-develop-a27xz7itza4l2htmlhfkyreobuw372sq/spack-src/src/H5Gloc.c:162
#8  0x2b8d6e166d95 in H5VL__native_object_get
        at /tmp/centos/spack-stage/spack-stage-hdf5-develop-a27xz7itza4l2htmlhfkyreobuw372sq/spack-src/src/H5VLnative_object.c:149
#9  0x2b8d6e13c27b in H5VL__object_get
        at /tmp/centos/spack-stage/spack-stage-hdf5-develop-a27xz7itza4l2htmlhfkyreobuw372sq/spack-src/src/H5VLcallback.c:5438
#10  0x2b8d6e14eb41 in H5VL_object_get
        at /tmp/centos/spack-stage/spack-stage-hdf5-develop-a27xz7itza4l2htmlhfkyreobuw372sq/spack-src/src/H5VLcallback.c:5474
#11  0x2b8d6df283bd in H5Iget_name
        at /tmp/centos/spack-stage/spack-stage-hdf5-develop-a27xz7itza4l2htmlhfkyreobuw372sq/spack-src/src/H5I.c:2205
#12  0x2b8d72a4d68d in ???
#13  0x2b8d72a4d89b in ???
#14  0x2b8d72a4d91a in ???
#15  0x2b8d72a4f03d in ???
#16  0x2b8d72a4ed5e in ???
#17  0x2b8d72a4ef27 in ???
#18  0x2b8d72a4f503 in ???
#19  0x2b8d729f3937 in ???
#20  0x187aff0 in __pfio_netcdf4_fileformattermod_MOD_close
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/NetCDF4_FileFormatter.F90:272
#21  0x18e9e95 in __pfio_historycollectionmod_MOD_clear
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/HistoryCollection.F90:107
#22  0x18a4e7f in __pfio_serverthreadmod_MOD_clear_hist_collections
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/ServerThread.F90:929
#23  0x1898cf3 in __pfio_multigroupservermod_MOD_start_back
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/MultiGroupServer.F90:689
#24  0x189d1cc in __pfio_multigroupservermod_MOD_start
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/MultiGroupServer.F90:180
#25  0x16fdac8 in __mapl_servermanager_MOD_initialize
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/base/ServerManager.F90:246
#26  0xdf993f in __mapl_capmod_MOD_initialize_io_clients_servers
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:192
#27  0xdf95aa in __mapl_capmod_MOD_run_ensemble
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:151
#28  0xdf9702 in __mapl_capmod_MOD_run
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:134
#29  0x42e07b in gchpctm_main
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/GCHPctm.F90:30
#30  0x42c918 in main
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/GCHPctm.F90:15

You need to comment out that line which is obsolete line=00107 ServerManager.F90

Wow, I think I fixed the problem. During troubleshooting of the original hang I opened this issue for, I linked my run output directory to a shared volume on my cluster. It turns out this volume is only 20GB. I'm not sure how this happened given my cluster configuration settings indicate I requested this volume to be 200GB. Sending output to the standard volume works perfectly. I'm still required to enable the oserver, but a single node or 2 is sufficient for writing our standard set of History diagnostics (still testing to figure out how many nodes / servers I need to use if I want to write all History). Thank you very much for your help troubleshooting.

I've done a lot of testing with different MPI versions / implementations and different libfabric versions to try to pin down why / when a dedicated oserver is required when running GCHP on AWS EC2.

I confirmed that hangs/crashes with the EFA provider without extra oserver nodes still occur if I disable History but still write a checkpoint file. Disabling WRITE_RESTART_BY_OSERVER fixes this checkpoint hang (but the hang occurs at History write time as well if I re-enable History with this change).

I also confirmed that using Intel compilers vs. GNU compilers makes no difference in the success of runs (tested with Intel oneapi 2021.2 compilers and gcc 9.3.0). I've tried using Intel MPI 2021.2 but have been failing at initialization with the following error:

pe=00159 FAIL at line=00310    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00159 FAIL at line=00932    MAPL_CapGridComp.F90                     <status=1>
pe=00159 FAIL at line=00245    MAPL_Cap.F90                             <status=1>
pe=00159 FAIL at line=00211    MAPL_Cap.F90                             <status=1>
pe=00159 FAIL at line=00154    MAPL_Cap.F90                             <status=1>
pe=00159 FAIL at line=00129    MAPL_Cap.F90                             <status=1>
pe=00159 FAIL at line=00030    GCHPctm.F90                              <status=1>
pe=00181 FAIL at line=00310    MAPL_CapGridComp.F90                     <something impossible happened>
pe=00181 FAIL at line=00932    MAPL_CapGridComp.F90                     <status=1>
pe=00181 FAIL at line=00245    MAPL_Cap.F90                             <status=1>
pe=00181 FAIL at line=00211    MAPL_Cap.F90                             <status=1>
Abort(0) on node 159 (rank 159 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 159

which is because for some reason it detects correctly that my CoresPerNode=36 but thinks that my npes=1. I have not tried out Intel MPI 2021.1.

All of my runs below used Intel MPI 2019 Update 10 or OpenMPI 4.1.0, 288 cores (8 full nodes), and c90 or c180 for running the model unless otherwise noted.

OpenMPI

When using OpenMPI (version 4.1.0) and recent versions of libfabric, there is no crash/hang when using the TCP provider and no extra oserver. If I use the EFA provider with the same setup, a crash occurs at output time:

libfabric:12402:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfabric:12402:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfabric:12402:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfabric:12402:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfabric:12402:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfabric:12402:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfabric:12402:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfalibfabric:13540:efa:cq:efa_cq_readerr():82<warn> Work completion status: remote invalid RD request
libfabric:13540:efa:cq:rxr_cq_handle_cq_error():345<warn> fi_cq_readerr: err: Input/output error (5), prov_err: unknown error (15)
libfabric:13540:efa:cq:rxr_cq_handle_rx_error():142<warn> rxr_cq_handle_rx_error: err: 5, prov_err: Unknown error -15 (15)
[compute-dy-c5n18xlarge-1][[36097,1],0][btl_ofi_context.c:443:mca_btl_ofi_context_progress] fi_cq_readerr: (provider err_code = 15)

[compute-dy-c5n18xlarge-1][[36097,1],0][btl_ofi_component.c:238:mca_btl_ofi_exit] BTL OFI will now abort.

Adding extra oserver nodes for use with OpenMPI + EFA + recent libfabric results in a successful run, but adds a significant amount of write time compared to just using TCP. I tested with between 3 and 8 nodes for the oserver and got anywhere between 10 and 30 extra minutes of runtime vs. TCP, with time decreasing as more nodes were added.

If I roll libfabric back to version 1.9.0 (dating from 2019, current version is 1.12.1), no oserver is necessary when using OpenMPI with EFA and similar performance results to TCP are achieved in my short c90/180 run tests. Libfabric versions of 1.10.0 or greater result in the same (fail / extra oserver-requiring) outcomes as version 1.12.1, and I think 1.9.1 does as well. I encounter no run issues if I do not explicitly build OpenMPI with fabric support with any libfabric version.

Intel MPI

For Intel MPI, any version of Intel MPI >2019 Update 4 hangs or crashes at output write time with EFA enabled without extra oserver nodes. I have been unable to get EFA working with Intel MPI for Intel MPI 2019 Update 4, which is unfortunate given this was the version used in Jiawei Zhuang's GCHP-on-the-cloud paper. Jiawei used GCHP 12.3.2, which I believe predated the I/O server in MAPL. Using older versions of Intel MPI + libfabric yielded no changes in performance / run outcome in my runs, except using Intel MPI 2019U4 with EFA or using libfabric <1.9.1 with EFA which resulted in a libfabric crash at initialization that I've been unable to work around.

Extra oserver nodes are necessary at c90+ when using Intel MPI with EFA to avoid hangs / crashes. When an oserver is used with EFA, Intel MPI performs much better with output than OpenMPI does (e.g. 9min total runtime for Intel MPI + EFA at c180 with default diagnostics and 1 oserver node vs. 30min for OpenMPI with 3 oserver nodes). 1 oserver node is sufficient for c90 with all diagnostics for Intel MPI + EFA and for c180 with default diagnostics. At least 4 oserver nodes are required for c180 with all diagnostics enabled.

Extra oserver nodes are not necessary for TCP with Intel MPI at c90, but TCP has identical oserver requirements as EFA for Intel MPI at c180 (1 node for default diagnostics, 4 nodes for all diagnostics). Note that this differs from OpenMPI where extra oserver nodes are never required with TCP.

Intel MPI + EFA does not require an extra oserver node at c24 on 288 cores. At c90 or c180 on 288 cores with no extra oserver node, a hang occurs at output. I mentioned in my original post that this hang occurs somewhere in call o_clients%done_collective_stage() in MAPL_HistoryGridComp.F90. From further tracing, it seems like this occurs in MPI_win_fence. The stack traces below are all from run setups that actually crash rather than hang, and many of them explicitly crash during calls to MPI_win_fence.

When I bump down to 216 cores at c90 or c180, a crash occurs at output with the following stack trace (this is with libfabric 1.12.1):

libfabric:26056:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfabric:26056:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfabric:26056:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfabric:26056:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TXforrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
gchp               0000000002EFB4CA  Unknown               Unknown  Unknown
libpthread-2.17.s  00002AE561D9B630  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002AE560E6673F  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002AE560E5CB5E  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002AE560E5BB90  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002AE560E7A937  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002AE560A7FC1F  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002AE56101EA24  MPI_Win_fence         Unknown  Unknown
libmpifort.so.12.  00002AE5606085DD  pmpi_win_fence_       Unknown  Unknown
gchp               0000000001C50BFE  Unknown               Unknown  Unknown
gchp               0000000001B528DE  Unknown               Unknown  Unknown
gchp               0000000001BB967E  Unknown               Unknown  Unknown
gchp               0000000001B38438  Unknown               Unknown  Unknown
gchp               0000000001C55424  Unknown               Unknown  Unknown
gchp               0000000001B49999  Unknown               Unknown  Unknown
gchp               0000000001BCD13C  Unknown               Unknown  Unknown
gchp               0000000001BD6CBB  Unknown               Unknown  Unknown
gchp               000000000101A279  mapl_historygridc        3571  MAPL_HistoryGridComp.F90
gchp               000000000249000E  Unknown               Unknown  Unknown
gchp               00000000024940FB  Unknown               Unknown  Unknown
gchp               00000000024A6CA2  Unknown               Unknown  Unknown
gchp               000000000249166A  Unknown               Unknown  Unknown
gchp               0000000001EEFF7D  Unknown               Unknown  Unknown
gchp               0000000002443371  Unknown               Unknown  Unknown
gchp               000000000164191C  mapl_genericmod_m        1834  MAPL_Generic.F90
gchp               000000000249000E  Unknown               Unknown  Unknown
gchp               00000000024940FB  Unknown               Unknown  Unknown
gchp               00000000024A6CA2  Unknown               Unknown  Unknown
gchp               000000000249166A  Unknown               Unknown  Unknown
gchp               0000000001EEFF7D  Unknown               Unknown  Unknown
gchp               0000000002443371  Unknown               Unknown  Unknown
gchp               0000000001006482  mapl_capgridcompm        1244  MAPL_CapGridComp.F90
gchp               0000000001005A11  mapl_capgridcompm        1170  MAPL_CapGridComp.F90
gchp               0000000001005700  mapl_capgridcompm        1117  MAPL_CapGridComp.F90
gchp               000000000100513C  mapl_capgridcompm         809  MAPL_CapGridComp.F90
gchp               000000000249000E  Unknown               Unknown  Unknown
gchp               00000000024940FB  Unknown               Unknown  Unknown
gchp               00000000024A6CA2  Unknown               Unknown  Unknown
gchp               000000000249166A  Unknown               Unknown  Unknown
gchp               0000000001EEFF7D  Unknown               Unknown  Unknown
gchp               0000000002443371  Unknown               Unknown  Unknown
gchp               000000000100971F  mapl_capgridcompm         948  MAPL_CapGridComp.F90
gchp               0000000000FFF6DE  mapl_capmod_mp_ru         246  MAPL_Cap.F90
gchp               0000000000FFEED5  mapl_capmod_mp_ru         211  MAPL_Cap.F90
gchp               0000000000FFDADD  mapl_capmod_mp_ru         154  MAPL_Cap.F90
gchp               0000000000FFD1DF  mapl_capmod_mp_ru         129  MAPL_Cap.F90
gchp               000000000053D4BF  MAIN__                     30  GCHPctm.F90
gchp               000000000053C2A2  Unknown               Unknown  Unknown
libc-2.17.so       00002AE56306C555  __libc_start_main     Unknown  Unknown
gchp               000000000053C1A9  Unknown               Unknown  Unknown

The uppermost intelligible line number is at call o_Clients%done_collective_stage() in MAPL_HistoryGridComp. This crash also occurs at 288 cores if I force I_MPI_FABRICS=ofi (rather than its default shm:ofi), with some extra error output (removed repeats) after the stack trace:

libfabric:11185:efa:cq:efa_cq_readerr():82<warn> Work completion status: remote invalid RD request
libfabric:11185:efa:cq:rxr_cq_handle_cq_error():345<warn> fi_cq_readerr: err: Input/output error (5), prov_err: unknown error (15)
libfabric:11185:efa:cq:rxr_cq_handle_tx_error():221<warn> rxr_cq_handle_tx_error: err: 5, prov_err: Unknown error -15 (15)
Abort(807550351) on node 252 (rank 252 in comm 0): Fatal error in PMPI_Win_fence: Other MPI error, error stack:
PMPI_Win_fence(124)............: MPI_Win_fence(assert=0, win=0xa0000004) failed
MPID_Win_fence(264)............:
MPIDIG_mpi_win_fence(489)......:
MPIDI_Progress_test(185).......:
MPIDI_OFI_handle_cq_error(1042): OFI poll failed (ofi_events.c:1042:MPIDI_OFI_handle_cq_error:Input/output error)

If I force Intel MPI 2019U10 to use its internal lifabric version and tell it to use EFA (and reset I_MPI_FABRICS=shm:ofi), I get a more detailed segfault crash message at output time:

#0  0x2b1adee3462f in ???
#1  0x2b1addeff73f in MPIDI_OFI_handle_lmt_ack
        at ../../src/mpid/ch4/netmod/include/../ofi/ofi_am_events.h:442
#2  0x2b1addeff73f in am_recv_event
        at ../../src/mpid/ch4/netmod/ofi/ofi_events.c:709
#3  0x2b1addef5b5d in MPIDI_OFI_dispatch_function
        at ../../src/mpid/ch4/netmod/ofi/ofi_events.c:830
#4  0x2b1addef4b8f in MPIDI_OFI_handle_cq_entries
        at ../../src/mpid/ch4/netmod/ofi/ofi_events.c:957
#5  0x2b1addf13936 in MPIDI_OFI_progress
        at ../../src/mpid/ch4/netmod/ofi/ofi_progress.c:40
#6  0x2b1addb18c1e in MPIDI_Progress_test
        at ../../src/mpid/ch4/src/ch4_progress.c:181
#7  0x2b1addb18c1e in MPID_Progress_test
        at ../../src/mpid/ch4/src/ch4_progress.c:236
#8  0x2b1ade0b7a23 in MPIDIG_mpi_win_fence
        at ../../src/mpid/ch4/src/ch4r_win.h:489
#9  0x2b1ade0b7a23 in MPID_Win_fence
        at ../../src/mpid/ch4/src/ch4_win.h:257
#10  0x2b1ade0b7a23 in PMPI_Win_fence
        at ../../src/mpi/rma/win_fence.c:108
#11  0x2b1add4815dc in pmpi_win_fence_
        at ../../src/binding/fortran/mpif_h/win_fencef.c:269
#12  0x191ecab in __pfio_rdmareferencemod_MOD_fence
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/RDMAReference.F90:159
#13  0x188406c in __pfio_baseservermod_MOD_receive_output_data
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/BaseServer.F90:75
#14  0x18ab31d in __pfio_serverthreadmod_MOD_handle_done_collective_stage
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/ServerThread.F90:980
#15  0x18ab31d in __pfio_serverthreadmod_MOD_handle_done_collective_stage
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/ServerThread.F90:961
#16  0x1922594 in __pfio_messagevisitormod_MOD_handle
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/MessageVisitor.F90:269
#17  0x1878971 in __pfio_abstractmessagemod_MOD_dispatch
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/AbstractMessage.F90:115
#18  0x187d5dc in __pfio_simplesocketmod_MOD_send
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/SimpleSocket.F90:105
#19  0x18aded7 in __pfio_clientthreadmod_MOD_done_collective_stage
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/ClientThread.F90:429
#20  0x18b2424 in __pfio_clientmanagermod_MOD_done_collective_stage
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/ClientManager.F90:381
#21  0xe1d437 in run
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/History/MAPL_HistoryGridComp.F90:3571
#22  0x1fcfaed in ???
#23  0x1fcfd93 in ???
#24  0x1cf6580 in ???
#25  0x1cc67a4 in ???
#26  0x1fcdeb8 in ???
#27  0x1b5651c in ???
#28  0x1f76eaf in ???
#29  0x138d76f in mapl_genericwrapper
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/base/MAPL_Generic.F90:1860
#30  0x1fcfaed in ???
#31  0x1fcfd93 in ???
#32  0x1cf6580 in ???
#33  0x1cc67a4 in ???
#34  0x1fcdeb8 in ???
#35  0x1b5651c in ???
#36  0x1f76eaf in ???
#37  0xdf8e07 in last_phase
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_CapGridComp.F90:1256
#38  0xdf8e07 in __mapl_capgridcompmod_MOD_step
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_CapGridComp.F90:1167
#39  0xdf95e0 in run_mapl_gridcomp
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_CapGridComp.F90:1116
#40  0xdf95e0 in run_gc
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_CapGridComp.F90:815
#41  0x1fcfaed in ???
#42  0x1fcfd93 in ???
#43  0x1cf6580 in ???
#44  0x1cc67a4 in ???
#45  0x1fcdeb8 in ???
#46  0x1b5651c in ???
#47  0x1f76eaf in ???
#48  0xdf69ed in __mapl_capgridcompmod_MOD_run
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_CapGridComp.F90:952
#49  0xdf38b6 in __mapl_capmod_MOD_run_model
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:246
#50  0xdf313f in __mapl_capmod_MOD_run_member
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:211
#51  0xdf335e in __mapl_capmod_MOD_run_ensemble
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:151
#52  0xdf3442 in __mapl_capmod_MOD_run
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:134
#53  0x427dbb in gchpctm_main
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/GCHPctm.F90:30
#54  0x426658 in main
        at /shared/gchp_fullchem_intelmpi/CodeDir/src/GCHPctm.F90:15

The crash message for Intel MPI at c180 with TCP (removing many repeated messages) is:

libfabric:19827:ofi_rxm:ep_data:rxm_ep_emulate_inject():1433<warn> Ran out of buffers from Eager buffer pool
libfabric:19827:ofi_rxm:ep_data:rxm_ep_emulate_inject():1433<warn> Ran out of buffers from Eager buffer pool
libfabric:19827:ofi_rxm:ep_data:rxm_ep_emulate_inject():1433<warn> Ran out of buffers from Eager buffer pool
libfabric:19827:ofi_rxm:ep_data:rxm_ep_emulate_inject():1433<warn> Ran out of buffers from Eager buffer pool
libfabric:16158:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1165<info> notify event 1
libfabric:16158:tcp:fabric:ofi_wait_del_fd():218<info> Given fd (17) not found in wait list - 0x60190f0
libfabric:19854:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1165<info> notify event 1
libfabric:19854:tcp:fabric:ofi_wait_del_fd():218<info> Given fd (66) not found in wait list - 0x617d0f0
libfabric:16151:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1165<info> notify event 1
libfabric:16151:tcp:fabric:ofi_wait_del_fd():218<info> Given fd (66) not found in wait list - 0x61a80f0
libfabric:13201:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1165<info> notify event 1
srun: error: compute-dy-c5n18xlarge-3: task 74: Killed
libfabric:16148:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1165<info> notify event 1
libfabric:16148:tcp:fabric:ofi_wait_del_fd():218<info> Given fd (63) not found in wait list - 0x68500f0
libfabric:16158:tcp:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:16158:tcp:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:16158:tcp:ep_ctrl:tcpx_cm_send_req():384<warn> connection failure
libfabric:16158:ofi_rxm:ep_ctrl:rxm_eq_readerr():83<warn> fi_eq_readerr: err: Unknown error -111 (-111), prov_err: Success (0)
libfabric:16158:ofi_rxm:ep_ctrl:rxm_conn_handle_event():1288<warn> Unknown event: 4224956323

I'm able to reproduce the hang in Intel MPI with EFA (seemingly in MPI_Win_Fence, results in the TX entries exhausted message being spammed) by running the following test program across multiple nodes with 1 process per node:

test.f (slightly modified from this issue on the Intel forums):

      program test

      use, intrinsic :: iso_fortran_env
      use, intrinsic :: iso_c_binding

      implicit none
      include 'mpif.h'
      character(8) :: arg
      real(real64), pointer, contiguous :: vec(:)
      real(real64) :: rdum
      integer      :: Mrank,Msize,Nrank,Nsize,Mcomm,Ncomm,Msplit,Merr
      integer      :: i,win_vec,disp,n,j
      integer(kind=mpi_address_kind) :: size
      type(c_ptr)  :: ptr

      nullify(vec)

      call mpi_init(Merr)
      call mpi_comm_size(mpi_comm_world,Msize,Merr)
      call mpi_comm_rank(mpi_comm_world,Mrank,Merr)
      Mcomm = mpi_comm_world

      call mpi_comm_split_type(Mcomm,mpi_comm_type_shared,0,
     &  mpi_info_null,Ncomm,Merr)
      call mpi_comm_size(Ncomm,Nsize,Merr)
      call mpi_comm_rank(Ncomm,Nrank,Merr)

      if(Nrank==0) then ; size = 1
      else              ; size = 0
      endif

      disp = 1
      call mpi_win_allocate(size,disp,mpi_info_null,Ncomm,ptr,
     &  win_vec,Merr)

      call c_f_pointer(ptr,vec,(/10/))
      call mpi_win_fence(0,win_vec,Merr)

      if(Mrank==0) then
        call getarg(1,arg)
        read(arg,*) n
      endif

      if(Nrank==0) vec = 0

      do j = 1,n
        do i = 1,10
          rdum = i*1d0
          call mpi_accumulate(rdum,1,MPI_DOUBLE_PRECISION,0,
     &      (i-1)*8_MPI_ADDRESS_KIND,1,MPI_DOUBLE_PRECISION,
     &      MPI_SUM,win_vec,Merr)
        enddo
      enddo

      if(Mrank==0) write(*,*) vec

      call mpi_win_fence(0,win_vec,Merr)
      write(*,*) 'Passed final fence'
      nullify(vec)
      call mpi_win_free(win_vec,Merr)
      write(*,*) 'Passed win_free'
      call mpi_comm_free(Ncomm,Merr)
      write(*,*) 'Passed comm_free'
      call mpi_finalize(Merr)
      write(*,*) 'Passed finalize'

      end

Using more than 1 process per node results in a successful completion. Using TCP results in a successful completion but with spam of libfabric:19827:ofi_rxm:ep_data:rxm_ep_emulate_inject():1433<warn> Ran out of buffers from Eager buffer pool.

My run script:

#!/bin/bash
#SBATCH --ntasks=2 --nodes=2

#assorted module loads 

FI_PROVIDER=efa
mpif90 test.f
mpirun -np 2 ./a.out 8491

Running for any fewer than 8491 iterations completes successfully and in only a few seconds (but outputs many lines of libfabric:6558:efa:ep_ctrl:rxr_ep_alloc_tx_entry():429<warn> TX entries exhausted. before finishing). Running for >=8491 iterations results in a hang.

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.

Closing due to inactivity

Re-opening. @laestrada and myself are going to be picking it up.

I'm going to close this issue in favor of https://github.com/GEOS-ESM/MAPL/issues/1184. It has more up-to-date info on the outstanding issue.

GEOS-ESM / MAPL

Extreme output performance degradation at higher core counts in GCHP using EFA fabric #739

OpenMPI

Intel MPI