Closed WilliamDowns closed 3 years ago
Does that mean a memory issue? If so, I suggest you divide the biggest file into some smaller ones with less variables in it. And please use more servers , i.e., do not use [1,1,1], instead, use [3] . The oserver will distribute the files among different nodes so the output files will not concentrate on one node.
One coincidence. I was helping out Dan Duffy who's trying to run our FV3 Standalone on AWS using Intel MPI and I found an issue by one @WilliamDowns that pointed me to MPIR_CVAR_CH4_OFI_ENABLE_RMA=0
which is letting him run.
Now my great fear is now somehow this flag is causing this issue in MAPL (we are running FV3 without any History or checkpointing because...it's not working well/at all on AWS for some reason).
I can say that on Discover, we've found we need to (sometimes) pass in FI_PSM2_
flags to workaround certain bugs. But that is Omnipath. But I can't see anything in fi_efa(7) that might directly help EFA here. 😦
Luckily I've tried toggling MPIR_CVAR_CH4_OFI_ENABLE_RMA
on and off and it doesn't change anything for me anymore (possibly because I'm now running a very up-to-date libfabric / Intel-MPI 2019).
Luckily I've tried toggling
MPIR_CVAR_CH4_OFI_ENABLE_RMA
on and off and it doesn't change anything for me anymore (possibly because I'm now running a very up-to-date libfabric / Intel-MPI 2019).
Well, I can tell you it's still needed with Intel MPI 2021.1 from oneAPI for our tests!
I tried increasing the number of oservers (tested a range of values all the way up to 24) and increased and decreased npes_backend_per_node
without success (actually seemingly ran into hangs at very high server counts where runtime was over 3 hours).
It will hang if the npes of frontend exceed npes of the model. Can you show me your setup of CapOptions?
Here are my current CapOptions (which yield the same error I've been getting):
type :: MAPL_CapOptions
integer :: comm
logical :: use_comm_world = .true.
character(:), allocatable :: egress_file
character(:), allocatable :: cap_rc_file
type (ESMF_LogKind_Flag) :: esmf_logging_mode = ESMF_LOGKIND_NONE
integer :: npes_model = 288
! only one of the next two options can have nonzero values
integer, allocatable :: npes_input_server(:)
integer, allocatable :: nodes_input_server(:)
! only one of the next two options can have nonzero values
integer, allocatable :: npes_output_server(:)
integer, allocatable :: nodes_output_server(:)
! whether or not the nodes are padding with idle when mod(model total npes , each node npes) /=0
logical :: isolate_nodes = .true.
! whether or not copy the data before isend to the oserver
! it is faster but demands more memory if it is true
logical :: fast_oclient = .true.
! server groups
integer :: n_iserver_group = 1
integer :: n_oserver_group = 1
! ensemble options
integer :: n_members = 1
character(:), allocatable :: ensemble_subdir_prefix
! logging options
character(:), allocatable :: logging_config
character(:), allocatable :: oserver_type
integer :: npes_backend_pernode = 2
end type MAPL_CapOptions
interface MAPL_CapOptions
module procedure new_CapOptions
end interface
contains
function new_CapOptions(unusable, cap_rc_file, egress_file, ensemble_subdir_prefix, esmf_logging_mode, rc) result (cap_options)
type (MAPL_CapOptions) :: cap_options
class (KeywordEnforcer), optional, intent(in) :: unusable
character(*), optional, intent(in) :: cap_rc_file
character(*), optional, intent(in) :: egress_file
character(*), optional, intent(in) :: ensemble_subdir_prefix
type(ESMF_LogKind_Flag), optional, intent(in) :: esmf_logging_mode
integer, optional, intent(out) :: rc
_UNUSED_DUMMY(unusable)
cap_options%cap_rc_file = 'CAP.rc'
cap_options%egress_file = 'EGRESS'
cap_options%oserver_type= 'multigroup'
!cap_options%oserver_type= 'single'
cap_options%ensemble_subdir_prefix = 'mem'
cap_options%npes_input_server =[0]
cap_options%nodes_input_server =[0]
cap_options%npes_output_server =[0]
!cap_options%nodes_output_server=[8]
!cap_options%nodes_output_server=[1,1,1,1,1,1,1,1]
cap_options%nodes_output_server=[4,4]
if (present(cap_rc_file)) cap_options%cap_rc_file = cap_rc_file
if (present(egress_file)) cap_options%egress_file = egress_file
if (present(ensemble_subdir_prefix)) cap_options%ensemble_subdir_prefix = ensemble_subdir_prefix
if (present(esmf_logging_mode)) cap_options%esmf_logging_mode = esmf_logging_mode
_RETURN(_SUCCESS)
Anything commented out is a setting I've tried (I've also tried several other combinations of values for nodes_output_server
, npes_backend_per_node
, and n_oserver_group
).
I will also note that the SpeciesConc History output file (the largest) writes up to 15.9GB, then stops for about 30 seconds, then I receive the segfault / HDF5 error (and receive an HDF error when trying to open the file either through Python or using ncdump
). The size of a successfully saved version of the file is 16.2GB. Again, this file isn't the only issue since disabling this diagnostic and increasing the number of other History diagnostics also results in a crash, but thought it was interesting.
Maybe the size of 16g is too big? To make sure that the sum of small collection wouldn't exceed the limit, try npes_backend_pernode = 1 . ( I am not sure your version, the assertion npes_backend_pernode > =2 is wrong, just comment that line out)
I set npes_backend_pernode=1
and received an error when starting the run:
pe=00504 FAIL at line=00107 ServerManager.F90 <captain-soldier need at lease two beckend>
pe=00504 FAIL at line=00193 MAPL_Cap.F90 <status=1>
pe=00504 FAIL at line=00152 MAPL_Cap.F90 <status=1>
pe=00504 FAIL at line=00129 MAPL_Cap.F90 <status=1>
pe=00504 FAIL at line=00030 GCHPctm.F90 <status=1>
Also, I did a run with all HDF5 debug flags enabled. Here's the end of the output + error message:
H5Dclose(dset=360287970189639688 (dset)) = SUCCEED;
H5Dclose(dset=360287970189640008 (dset)) = SUCCEED;
H5Gclose(group=144115188075855872 (group)) = SUCCEED;
H5Fclose(file=72057594037927936 (file)) = FAIL;
H5Fget_obj_count(file=72057594037927936 (file), types=31) = 1;
H5Fget_obj_count(file=72057594037927936 (file), types=31) = 1;
H5Fget_obj_ids(file=72057594037927936 (file), types=1, max_objs=1, oid_list=0x0x90175b0) = 1;
H5Iget_name(id=72057594037927936 (file), name=0x0x7fffd5aac390, size=1024)gchp_bigdebug: H5Groot.c:96: H5G_rootof: Assertion `f->shared' failed.
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 0x2b8d7360d3ff in ???
#1 0x2b8d7360d387 in ???
#2 0x2b8d7360ea77 in ???
#3 0x2b8d736061a5 in ???
#4 0x2b8d73606251 in ???
#5 0x2b8d6dedc512 in H5G_rootof
at /tmp/centos/spack-stage/spack-stage-hdf5-develop-a27xz7itza4l2htmlhfkyreobuw372sq/spack-src/src/H5Groot.c:96
#6 0x2b8d6dedd0e2 in H5G_root_loc
at /tmp/centos/spack-stage/spack-stage-hdf5-develop-a27xz7itza4l2htmlhfkyreobuw372sq/spack-src/src/H5Groot.c:375
#7 0x2b8d6decec98 in H5G_loc_real
at /tmp/centos/spack-stage/spack-stage-hdf5-develop-a27xz7itza4l2htmlhfkyreobuw372sq/spack-src/src/H5Gloc.c:162
#8 0x2b8d6e166d95 in H5VL__native_object_get
at /tmp/centos/spack-stage/spack-stage-hdf5-develop-a27xz7itza4l2htmlhfkyreobuw372sq/spack-src/src/H5VLnative_object.c:149
#9 0x2b8d6e13c27b in H5VL__object_get
at /tmp/centos/spack-stage/spack-stage-hdf5-develop-a27xz7itza4l2htmlhfkyreobuw372sq/spack-src/src/H5VLcallback.c:5438
#10 0x2b8d6e14eb41 in H5VL_object_get
at /tmp/centos/spack-stage/spack-stage-hdf5-develop-a27xz7itza4l2htmlhfkyreobuw372sq/spack-src/src/H5VLcallback.c:5474
#11 0x2b8d6df283bd in H5Iget_name
at /tmp/centos/spack-stage/spack-stage-hdf5-develop-a27xz7itza4l2htmlhfkyreobuw372sq/spack-src/src/H5I.c:2205
#12 0x2b8d72a4d68d in ???
#13 0x2b8d72a4d89b in ???
#14 0x2b8d72a4d91a in ???
#15 0x2b8d72a4f03d in ???
#16 0x2b8d72a4ed5e in ???
#17 0x2b8d72a4ef27 in ???
#18 0x2b8d72a4f503 in ???
#19 0x2b8d729f3937 in ???
#20 0x187aff0 in __pfio_netcdf4_fileformattermod_MOD_close
at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/NetCDF4_FileFormatter.F90:272
#21 0x18e9e95 in __pfio_historycollectionmod_MOD_clear
at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/HistoryCollection.F90:107
#22 0x18a4e7f in __pfio_serverthreadmod_MOD_clear_hist_collections
at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/ServerThread.F90:929
#23 0x1898cf3 in __pfio_multigroupservermod_MOD_start_back
at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/MultiGroupServer.F90:689
#24 0x189d1cc in __pfio_multigroupservermod_MOD_start
at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/MultiGroupServer.F90:180
#25 0x16fdac8 in __mapl_servermanager_MOD_initialize
at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/base/ServerManager.F90:246
#26 0xdf993f in __mapl_capmod_MOD_initialize_io_clients_servers
at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:192
#27 0xdf95aa in __mapl_capmod_MOD_run_ensemble
at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:151
#28 0xdf9702 in __mapl_capmod_MOD_run
at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:134
#29 0x42e07b in gchpctm_main
at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/GCHPctm.F90:30
#30 0x42c918 in main
at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/GCHPctm.F90:15
You need to comment out that line which is obsolete line=00107 ServerManager.F90
Wow, I think I fixed the problem. During troubleshooting of the original hang I opened this issue for, I linked my run output directory to a shared volume on my cluster. It turns out this volume is only 20GB. I'm not sure how this happened given my cluster configuration settings indicate I requested this volume to be 200GB. Sending output to the standard volume works perfectly. I'm still required to enable the oserver, but a single node or 2 is sufficient for writing our standard set of History diagnostics (still testing to figure out how many nodes / servers I need to use if I want to write all History). Thank you very much for your help troubleshooting.
I've done a lot of testing with different MPI versions / implementations and different libfabric versions to try to pin down why / when a dedicated oserver is required when running GCHP on AWS EC2.
I confirmed that hangs/crashes with the EFA provider without extra oserver nodes still occur if I disable History but still write a checkpoint file. Disabling WRITE_RESTART_BY_OSERVER
fixes this checkpoint hang (but the hang occurs at History write time as well if I re-enable History with this change).
I also confirmed that using Intel compilers vs. GNU compilers makes no difference in the success of runs (tested with Intel oneapi 2021.2 compilers and gcc 9.3.0). I've tried using Intel MPI 2021.2 but have been failing at initialization with the following error:
pe=00159 FAIL at line=00310 MAPL_CapGridComp.F90 <something impossible happened>
pe=00159 FAIL at line=00932 MAPL_CapGridComp.F90 <status=1>
pe=00159 FAIL at line=00245 MAPL_Cap.F90 <status=1>
pe=00159 FAIL at line=00211 MAPL_Cap.F90 <status=1>
pe=00159 FAIL at line=00154 MAPL_Cap.F90 <status=1>
pe=00159 FAIL at line=00129 MAPL_Cap.F90 <status=1>
pe=00159 FAIL at line=00030 GCHPctm.F90 <status=1>
pe=00181 FAIL at line=00310 MAPL_CapGridComp.F90 <something impossible happened>
pe=00181 FAIL at line=00932 MAPL_CapGridComp.F90 <status=1>
pe=00181 FAIL at line=00245 MAPL_Cap.F90 <status=1>
pe=00181 FAIL at line=00211 MAPL_Cap.F90 <status=1>
Abort(0) on node 159 (rank 159 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 159
which is because for some reason it detects correctly that my CoresPerNode=36
but thinks that my npes=1
. I have not tried out Intel MPI 2021.1.
All of my runs below used Intel MPI 2019 Update 10 or OpenMPI 4.1.0, 288 cores (8 full nodes), and c90 or c180 for running the model unless otherwise noted.
When using OpenMPI (version 4.1.0) and recent versions of libfabric, there is no crash/hang when using the TCP provider and no extra oserver. If I use the EFA provider with the same setup, a crash occurs at output time:
libfabric:12402:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfabric:12402:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfabric:12402:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfabric:12402:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfabric:12402:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfabric:12402:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfabric:12402:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfalibfabric:13540:efa:cq:efa_cq_readerr():82<warn> Work completion status: remote invalid RD request
libfabric:13540:efa:cq:rxr_cq_handle_cq_error():345<warn> fi_cq_readerr: err: Input/output error (5), prov_err: unknown error (15)
libfabric:13540:efa:cq:rxr_cq_handle_rx_error():142<warn> rxr_cq_handle_rx_error: err: 5, prov_err: Unknown error -15 (15)
[compute-dy-c5n18xlarge-1][[36097,1],0][btl_ofi_context.c:443:mca_btl_ofi_context_progress] fi_cq_readerr: (provider err_code = 15)
[compute-dy-c5n18xlarge-1][[36097,1],0][btl_ofi_component.c:238:mca_btl_ofi_exit] BTL OFI will now abort.
Adding extra oserver nodes for use with OpenMPI + EFA + recent libfabric results in a successful run, but adds a significant amount of write time compared to just using TCP. I tested with between 3 and 8 nodes for the oserver and got anywhere between 10 and 30 extra minutes of runtime vs. TCP, with time decreasing as more nodes were added.
If I roll libfabric back to version 1.9.0 (dating from 2019, current version is 1.12.1), no oserver is necessary when using OpenMPI with EFA and similar performance results to TCP are achieved in my short c90/180 run tests. Libfabric versions of 1.10.0 or greater result in the same (fail / extra oserver-requiring) outcomes as version 1.12.1, and I think 1.9.1 does as well. I encounter no run issues if I do not explicitly build OpenMPI with fabric support with any libfabric version.
For Intel MPI, any version of Intel MPI >2019 Update 4 hangs or crashes at output write time with EFA enabled without extra oserver nodes. I have been unable to get EFA working with Intel MPI for Intel MPI 2019 Update 4, which is unfortunate given this was the version used in Jiawei Zhuang's GCHP-on-the-cloud paper. Jiawei used GCHP 12.3.2, which I believe predated the I/O server in MAPL. Using older versions of Intel MPI + libfabric yielded no changes in performance / run outcome in my runs, except using Intel MPI 2019U4 with EFA or using libfabric <1.9.1 with EFA which resulted in a libfabric crash at initialization that I've been unable to work around.
Extra oserver nodes are necessary at c90+ when using Intel MPI with EFA to avoid hangs / crashes. When an oserver is used with EFA, Intel MPI performs much better with output than OpenMPI does (e.g. 9min total runtime for Intel MPI + EFA at c180 with default diagnostics and 1 oserver node vs. 30min for OpenMPI with 3 oserver nodes). 1 oserver node is sufficient for c90 with all diagnostics for Intel MPI + EFA and for c180 with default diagnostics. At least 4 oserver nodes are required for c180 with all diagnostics enabled.
Extra oserver nodes are not necessary for TCP with Intel MPI at c90, but TCP has identical oserver requirements as EFA for Intel MPI at c180 (1 node for default diagnostics, 4 nodes for all diagnostics). Note that this differs from OpenMPI where extra oserver nodes are never required with TCP.
Intel MPI + EFA does not require an extra oserver node at c24 on 288 cores. At c90 or c180 on 288 cores with no extra oserver node, a hang occurs at output. I mentioned in my original post that this hang occurs somewhere in call o_clients%done_collective_stage()
in MAPL_HistoryGridComp.F90
. From further tracing, it seems like this occurs in MPI_win_fence
. The stack traces below are all from run setups that actually crash rather than hang, and many of them explicitly crash during calls to MPI_win_fence
.
When I bump down to 216 cores at c90 or c180, a crash occurs at output with the following stack trace (this is with libfabric 1.12.1):
libfabric:26056:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfabric:26056:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfabric:26056:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
libfabric:26056:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TXforrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
gchp 0000000002EFB4CA Unknown Unknown Unknown
libpthread-2.17.s 00002AE561D9B630 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AE560E6673F Unknown Unknown Unknown
libmpi.so.12.0.0 00002AE560E5CB5E Unknown Unknown Unknown
libmpi.so.12.0.0 00002AE560E5BB90 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AE560E7A937 Unknown Unknown Unknown
libmpi.so.12.0.0 00002AE560A7FC1F Unknown Unknown Unknown
libmpi.so.12.0.0 00002AE56101EA24 MPI_Win_fence Unknown Unknown
libmpifort.so.12. 00002AE5606085DD pmpi_win_fence_ Unknown Unknown
gchp 0000000001C50BFE Unknown Unknown Unknown
gchp 0000000001B528DE Unknown Unknown Unknown
gchp 0000000001BB967E Unknown Unknown Unknown
gchp 0000000001B38438 Unknown Unknown Unknown
gchp 0000000001C55424 Unknown Unknown Unknown
gchp 0000000001B49999 Unknown Unknown Unknown
gchp 0000000001BCD13C Unknown Unknown Unknown
gchp 0000000001BD6CBB Unknown Unknown Unknown
gchp 000000000101A279 mapl_historygridc 3571 MAPL_HistoryGridComp.F90
gchp 000000000249000E Unknown Unknown Unknown
gchp 00000000024940FB Unknown Unknown Unknown
gchp 00000000024A6CA2 Unknown Unknown Unknown
gchp 000000000249166A Unknown Unknown Unknown
gchp 0000000001EEFF7D Unknown Unknown Unknown
gchp 0000000002443371 Unknown Unknown Unknown
gchp 000000000164191C mapl_genericmod_m 1834 MAPL_Generic.F90
gchp 000000000249000E Unknown Unknown Unknown
gchp 00000000024940FB Unknown Unknown Unknown
gchp 00000000024A6CA2 Unknown Unknown Unknown
gchp 000000000249166A Unknown Unknown Unknown
gchp 0000000001EEFF7D Unknown Unknown Unknown
gchp 0000000002443371 Unknown Unknown Unknown
gchp 0000000001006482 mapl_capgridcompm 1244 MAPL_CapGridComp.F90
gchp 0000000001005A11 mapl_capgridcompm 1170 MAPL_CapGridComp.F90
gchp 0000000001005700 mapl_capgridcompm 1117 MAPL_CapGridComp.F90
gchp 000000000100513C mapl_capgridcompm 809 MAPL_CapGridComp.F90
gchp 000000000249000E Unknown Unknown Unknown
gchp 00000000024940FB Unknown Unknown Unknown
gchp 00000000024A6CA2 Unknown Unknown Unknown
gchp 000000000249166A Unknown Unknown Unknown
gchp 0000000001EEFF7D Unknown Unknown Unknown
gchp 0000000002443371 Unknown Unknown Unknown
gchp 000000000100971F mapl_capgridcompm 948 MAPL_CapGridComp.F90
gchp 0000000000FFF6DE mapl_capmod_mp_ru 246 MAPL_Cap.F90
gchp 0000000000FFEED5 mapl_capmod_mp_ru 211 MAPL_Cap.F90
gchp 0000000000FFDADD mapl_capmod_mp_ru 154 MAPL_Cap.F90
gchp 0000000000FFD1DF mapl_capmod_mp_ru 129 MAPL_Cap.F90
gchp 000000000053D4BF MAIN__ 30 GCHPctm.F90
gchp 000000000053C2A2 Unknown Unknown Unknown
libc-2.17.so 00002AE56306C555 __libc_start_main Unknown Unknown
gchp 000000000053C1A9 Unknown Unknown Unknown
The uppermost intelligible line number is at call o_Clients%done_collective_stage()
in MAPL_HistoryGridComp. This crash also occurs at 288 cores if I force I_MPI_FABRICS=ofi
(rather than its default shm:ofi
), with some extra error output (removed repeats) after the stack trace:
libfabric:11185:efa:cq:efa_cq_readerr():82<warn> Work completion status: remote invalid RD request
libfabric:11185:efa:cq:rxr_cq_handle_cq_error():345<warn> fi_cq_readerr: err: Input/output error (5), prov_err: unknown error (15)
libfabric:11185:efa:cq:rxr_cq_handle_tx_error():221<warn> rxr_cq_handle_tx_error: err: 5, prov_err: Unknown error -15 (15)
Abort(807550351) on node 252 (rank 252 in comm 0): Fatal error in PMPI_Win_fence: Other MPI error, error stack:
PMPI_Win_fence(124)............: MPI_Win_fence(assert=0, win=0xa0000004) failed
MPID_Win_fence(264)............:
MPIDIG_mpi_win_fence(489)......:
MPIDI_Progress_test(185).......:
MPIDI_OFI_handle_cq_error(1042): OFI poll failed (ofi_events.c:1042:MPIDI_OFI_handle_cq_error:Input/output error)
If I force Intel MPI 2019U10 to use its internal lifabric version and tell it to use EFA (and reset I_MPI_FABRICS=shm:ofi
), I get a more detailed segfault crash message at output time:
#0 0x2b1adee3462f in ???
#1 0x2b1addeff73f in MPIDI_OFI_handle_lmt_ack
at ../../src/mpid/ch4/netmod/include/../ofi/ofi_am_events.h:442
#2 0x2b1addeff73f in am_recv_event
at ../../src/mpid/ch4/netmod/ofi/ofi_events.c:709
#3 0x2b1addef5b5d in MPIDI_OFI_dispatch_function
at ../../src/mpid/ch4/netmod/ofi/ofi_events.c:830
#4 0x2b1addef4b8f in MPIDI_OFI_handle_cq_entries
at ../../src/mpid/ch4/netmod/ofi/ofi_events.c:957
#5 0x2b1addf13936 in MPIDI_OFI_progress
at ../../src/mpid/ch4/netmod/ofi/ofi_progress.c:40
#6 0x2b1addb18c1e in MPIDI_Progress_test
at ../../src/mpid/ch4/src/ch4_progress.c:181
#7 0x2b1addb18c1e in MPID_Progress_test
at ../../src/mpid/ch4/src/ch4_progress.c:236
#8 0x2b1ade0b7a23 in MPIDIG_mpi_win_fence
at ../../src/mpid/ch4/src/ch4r_win.h:489
#9 0x2b1ade0b7a23 in MPID_Win_fence
at ../../src/mpid/ch4/src/ch4_win.h:257
#10 0x2b1ade0b7a23 in PMPI_Win_fence
at ../../src/mpi/rma/win_fence.c:108
#11 0x2b1add4815dc in pmpi_win_fence_
at ../../src/binding/fortran/mpif_h/win_fencef.c:269
#12 0x191ecab in __pfio_rdmareferencemod_MOD_fence
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/RDMAReference.F90:159
#13 0x188406c in __pfio_baseservermod_MOD_receive_output_data
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/BaseServer.F90:75
#14 0x18ab31d in __pfio_serverthreadmod_MOD_handle_done_collective_stage
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/ServerThread.F90:980
#15 0x18ab31d in __pfio_serverthreadmod_MOD_handle_done_collective_stage
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/ServerThread.F90:961
#16 0x1922594 in __pfio_messagevisitormod_MOD_handle
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/MessageVisitor.F90:269
#17 0x1878971 in __pfio_abstractmessagemod_MOD_dispatch
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/AbstractMessage.F90:115
#18 0x187d5dc in __pfio_simplesocketmod_MOD_send
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/SimpleSocket.F90:105
#19 0x18aded7 in __pfio_clientthreadmod_MOD_done_collective_stage
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/ClientThread.F90:429
#20 0x18b2424 in __pfio_clientmanagermod_MOD_done_collective_stage
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/ClientManager.F90:381
#21 0xe1d437 in run
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/History/MAPL_HistoryGridComp.F90:3571
#22 0x1fcfaed in ???
#23 0x1fcfd93 in ???
#24 0x1cf6580 in ???
#25 0x1cc67a4 in ???
#26 0x1fcdeb8 in ???
#27 0x1b5651c in ???
#28 0x1f76eaf in ???
#29 0x138d76f in mapl_genericwrapper
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/base/MAPL_Generic.F90:1860
#30 0x1fcfaed in ???
#31 0x1fcfd93 in ???
#32 0x1cf6580 in ???
#33 0x1cc67a4 in ???
#34 0x1fcdeb8 in ???
#35 0x1b5651c in ???
#36 0x1f76eaf in ???
#37 0xdf8e07 in last_phase
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_CapGridComp.F90:1256
#38 0xdf8e07 in __mapl_capgridcompmod_MOD_step
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_CapGridComp.F90:1167
#39 0xdf95e0 in run_mapl_gridcomp
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_CapGridComp.F90:1116
#40 0xdf95e0 in run_gc
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_CapGridComp.F90:815
#41 0x1fcfaed in ???
#42 0x1fcfd93 in ???
#43 0x1cf6580 in ???
#44 0x1cc67a4 in ???
#45 0x1fcdeb8 in ???
#46 0x1b5651c in ???
#47 0x1f76eaf in ???
#48 0xdf69ed in __mapl_capgridcompmod_MOD_run
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_CapGridComp.F90:952
#49 0xdf38b6 in __mapl_capmod_MOD_run_model
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:246
#50 0xdf313f in __mapl_capmod_MOD_run_member
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:211
#51 0xdf335e in __mapl_capmod_MOD_run_ensemble
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:151
#52 0xdf3442 in __mapl_capmod_MOD_run
at /shared/gchp_fullchem_intelmpi/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:134
#53 0x427dbb in gchpctm_main
at /shared/gchp_fullchem_intelmpi/CodeDir/src/GCHPctm.F90:30
#54 0x426658 in main
at /shared/gchp_fullchem_intelmpi/CodeDir/src/GCHPctm.F90:15
The crash message for Intel MPI at c180 with TCP (removing many repeated messages) is:
libfabric:19827:ofi_rxm:ep_data:rxm_ep_emulate_inject():1433<warn> Ran out of buffers from Eager buffer pool
libfabric:19827:ofi_rxm:ep_data:rxm_ep_emulate_inject():1433<warn> Ran out of buffers from Eager buffer pool
libfabric:19827:ofi_rxm:ep_data:rxm_ep_emulate_inject():1433<warn> Ran out of buffers from Eager buffer pool
libfabric:19827:ofi_rxm:ep_data:rxm_ep_emulate_inject():1433<warn> Ran out of buffers from Eager buffer pool
libfabric:16158:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1165<info> notify event 1
libfabric:16158:tcp:fabric:ofi_wait_del_fd():218<info> Given fd (17) not found in wait list - 0x60190f0
libfabric:19854:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1165<info> notify event 1
libfabric:19854:tcp:fabric:ofi_wait_del_fd():218<info> Given fd (66) not found in wait list - 0x617d0f0
libfabric:16151:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1165<info> notify event 1
libfabric:16151:tcp:fabric:ofi_wait_del_fd():218<info> Given fd (66) not found in wait list - 0x61a80f0
libfabric:13201:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1165<info> notify event 1
srun: error: compute-dy-c5n18xlarge-3: task 74: Killed
libfabric:16148:ofi_rxm:ep_ctrl:rxm_conn_handle_notify():1165<info> notify event 1
libfabric:16148:tcp:fabric:ofi_wait_del_fd():218<info> Given fd (63) not found in wait list - 0x68500f0
libfabric:16158:tcp:core:ofi_check_rx_attr():786<info> Tx only caps ignored in Rx caps
libfabric:16158:tcp:core:ofi_check_tx_attr():884<info> Rx only caps ignored in Tx caps
libfabric:16158:tcp:ep_ctrl:tcpx_cm_send_req():384<warn> connection failure
libfabric:16158:ofi_rxm:ep_ctrl:rxm_eq_readerr():83<warn> fi_eq_readerr: err: Unknown error -111 (-111), prov_err: Success (0)
libfabric:16158:ofi_rxm:ep_ctrl:rxm_conn_handle_event():1288<warn> Unknown event: 4224956323
I'm able to reproduce the hang in Intel MPI with EFA (seemingly in MPI_Win_Fence
, results in the TX entries exhausted
message being spammed) by running the following test program across multiple nodes with 1 process per node:
test.f
(slightly modified from this issue on the Intel forums):
program test
use, intrinsic :: iso_fortran_env
use, intrinsic :: iso_c_binding
implicit none
include 'mpif.h'
character(8) :: arg
real(real64), pointer, contiguous :: vec(:)
real(real64) :: rdum
integer :: Mrank,Msize,Nrank,Nsize,Mcomm,Ncomm,Msplit,Merr
integer :: i,win_vec,disp,n,j
integer(kind=mpi_address_kind) :: size
type(c_ptr) :: ptr
nullify(vec)
call mpi_init(Merr)
call mpi_comm_size(mpi_comm_world,Msize,Merr)
call mpi_comm_rank(mpi_comm_world,Mrank,Merr)
Mcomm = mpi_comm_world
call mpi_comm_split_type(Mcomm,mpi_comm_type_shared,0,
& mpi_info_null,Ncomm,Merr)
call mpi_comm_size(Ncomm,Nsize,Merr)
call mpi_comm_rank(Ncomm,Nrank,Merr)
if(Nrank==0) then ; size = 1
else ; size = 0
endif
disp = 1
call mpi_win_allocate(size,disp,mpi_info_null,Ncomm,ptr,
& win_vec,Merr)
call c_f_pointer(ptr,vec,(/10/))
call mpi_win_fence(0,win_vec,Merr)
if(Mrank==0) then
call getarg(1,arg)
read(arg,*) n
endif
if(Nrank==0) vec = 0
do j = 1,n
do i = 1,10
rdum = i*1d0
call mpi_accumulate(rdum,1,MPI_DOUBLE_PRECISION,0,
& (i-1)*8_MPI_ADDRESS_KIND,1,MPI_DOUBLE_PRECISION,
& MPI_SUM,win_vec,Merr)
enddo
enddo
if(Mrank==0) write(*,*) vec
call mpi_win_fence(0,win_vec,Merr)
write(*,*) 'Passed final fence'
nullify(vec)
call mpi_win_free(win_vec,Merr)
write(*,*) 'Passed win_free'
call mpi_comm_free(Ncomm,Merr)
write(*,*) 'Passed comm_free'
call mpi_finalize(Merr)
write(*,*) 'Passed finalize'
end
Using more than 1 process per node results in a successful completion. Using TCP results in a successful completion but with spam of libfabric:19827:ofi_rxm:ep_data:rxm_ep_emulate_inject():1433<warn> Ran out of buffers from Eager buffer pool
.
My run script:
#!/bin/bash
#SBATCH --ntasks=2 --nodes=2
#assorted module loads
FI_PROVIDER=efa
mpif90 test.f
mpirun -np 2 ./a.out 8491
Running for any fewer than 8491 iterations completes successfully and in only a few seconds (but outputs many lines of libfabric:6558:efa:ep_ctrl:rxr_ep_alloc_tx_entry():429<warn> TX entries exhausted.
before finishing). Running for >=8491 iterations results in a hang.
This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.
Closing due to inactivity
Re-opening. @laestrada and myself are going to be picking it up.
I'm going to close this issue in favor of https://github.com/GEOS-ESM/MAPL/issues/1184. It has more up-to-date info on the outstanding issue.
I commented in #652 that I was encountering perpetual hangs at output time in GCHP using Intel MPI and Amazon's EFA fabric provider on AWS EC2. Consecutive 1-hour runs at c90 on 2 nodes would actually alternate hanging perpetually at output time, crashing at output time, and finishing with only a benign end-of-run crash, all without me modifying the environment submitted through Slurm. These issues were fixed by updating libfabric from 1.11.1 to 1.11.2. However, at higher core counts (288 cores across 8 nodes vs. 72 cores across 2 nodes in my original tests), I'm still running into indefinite hangs at output time using EFA with both OpenMPI and IntelMPI. Setting
FI_PROVIDER=tcp
fixes this issue (for OpenMPI; I get immediate crashes right now for TCP + Intel MPI on AWS), but is not a long-term fix. I've tried updating to MAPL 2.5 and cherry-picking https://github.com/GEOS-ESM/MAPL/commit/eda17539c040f5953c7e0656c342da4826a613bc and https://github.com/GEOS-ESM/MAPL/commit/bb20beeba61430069bf751ac27d89f540862d796 to no avail.The hang seemingly occurs at
o_clients%done_collective_stage()
inMAPL_HistoryGridComp.F90
. If I turn on libfabric debug logs, I get spammed with millions of lines oflibfabric:13761:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted.
andlibfabric:13761:efa:ep_ctrl:rxr_ep_alloc_tx_entry():479<warn> TX entries exhausted.
at this call, with these warnings continuing to be printed in OpenMPI every few seconds (I cancelled my job after 45 minutes, compared to 7 minutes to completion for TCP runs) but stopping indefinitely after one burst for Intel MPI.I plan to open an issue on the libfabric Github page, but I was wondering if anyone had any suggestions on further additions to MAPL post-2.5 I could try out that might affect this problem, or any suggestions on environment variables to test.