WilliamDowns commented 3 years ago

I commented in #652 that I was encountering perpetual hangs at output time in GCHP using Intel MPI and Amazon's EFA fabric provider on AWS EC2. Consecutive 1-hour runs at c90 on 2 nodes would actually alternate hanging perpetually at output time, crashing at output time, and finishing with only a benign end-of-run crash, all without me modifying the environment submitted through Slurm. These issues were fixed by updating libfabric from 1.11.1 to 1.11.2. However, at higher core counts (288 cores across 8 nodes vs. 72 cores across 2 nodes in my original tests), I'm still running into indefinite hangs at output time using EFA with both OpenMPI and IntelMPI. Setting FI_PROVIDER=tcp fixes this issue (for OpenMPI; I get immediate crashes right now for TCP + Intel MPI on AWS), but is not a long-term fix. I've tried updating to MAPL 2.5 and cherry-picking https://github.com/GEOS-ESM/MAPL/commit/eda17539c040f5953c7e0656c342da4826a613bc and https://github.com/GEOS-ESM/MAPL/commit/bb20beeba61430069bf751ac27d89f540862d796 to no avail.

The hang seemingly occurs at o_clients%done_collective_stage() in MAPL_HistoryGridComp.F90. If I turn on libfabric debug logs, I get spammed with millions of lines of libfabric:13761:efa:ep_ctrl:rxr_rma_alloc_tx_entry():139<warn> TX entries exhausted. and libfabric:13761:efa:ep_ctrl:rxr_ep_alloc_tx_entry():479<warn> TX entries exhausted. at this call, with these warnings continuing to be printed in OpenMPI every few seconds (I cancelled my job after 45 minutes, compared to 7 minutes to completion for TCP runs) but stopping indefinitely after one burst for Intel MPI.

I plan to open an issue on the libfabric Github page, but I was wondering if anyone had any suggestions on further additions to MAPL post-2.5 I could try out that might affect this problem, or any suggestions on environment variables to test.

tclune commented 3 years ago

@WilliamDowns I'm sorry that you are still encountering this issue. The fact that it persists regardless of MPI flavor certainly suggests that the problem is in MAPL. Unfortunately, to make progress we need to reproduce on our end which means we need to learn how to build/run your version of GCHP in an environment that shows the problem. (E.g., maybe even running your model on our system may not do it.) We can pursue this, but probably not with enough priority to be useful for you.

I've added @weiyuan-jiang to the ticket. He's the most knowledgeable person for how this meant to work. Perhaps he has some thoughts about variations to try that might better hint at the problem.

One thought that I have is just to try to run the now-working 90 case on more nodes but with the same number of processes. (I.e., don't fully populate the nodes.) If we see a dependence on the number of nodes, it is some hint that the problem is still in the MPI/fabric rather than our end. As I said above, this seems less likely now, but ... it's the only concrete thought I have off the top of my head.

WilliamDowns commented 3 years ago

I'll definitely try out increasing the node count while maintaining the core count, and I'll also try more intermediate core counts. I completely understand that this is probably not reproducible on your end given the apparent reliance on Amazon's proprietary fabric. Hopefully libfabric / AWS developers can shine more light if needed.

weiyuan-jiang commented 3 years ago

@WilliamDowns Before fighting the MPI/Fabric, I suggest you update MAPL. Now we have at least two options: 1) Only use one node as output sever. 2) use MultiGroupServer that does not not have shared memory. Here is my example of my two runs on 40-core Skylake:

1) $GEOSBIN/esma_mpirun -np 160 ./GEOSgcm.x --npes_model 96 --nodes_output_server 1 --logging_config 'logging.yaml'

2)$GEOSBIN/esma_mpirun -np 200 ./GEOSgcm.x --npes_model 96 --nodes_output_server 2 --oserver_type multigroup --npes_backend_pernode 6 --logging_config 'logging.yaml'

tclune commented 3 years ago

@weiyuan-jiang This user is not running GEOS and is not using command line options.

@WilliamDowns The options that are mentioned are in CapOptions and are presumably instead set by some initialization layer in GCHP. The analogs to the above should be more-or-less obvious.

tclune commented 3 years ago

I also don't think that GCHP is using the "external" server at all, so the proposed change is nontrivial. (Though perhaps useful from a throughput perspective.)

WilliamDowns commented 3 years ago

So I've been playing around with different core and node counts / distributions: 72 cores distributed across 8 nodes runs fine, and I was able to consistently successfully run on up to 144 cores regardless of node count. At 192 and 216 cores, the model runs successfully without hanging in about half of my runs, but hangs in the other half (without me changing anything about the run configuration). I have not succeeded in running without a hang at core counts beyond 216 (tested 240 and 288). I've also experimented with setting CapOptions in MAPL_CapOptions.F90, but if I change any of the npes/nodes_input/output_server variables to non-zero I get an infinite hang during initialization. Also, setting I_MPI_FABRICS="ofi" (use EFA for both inter- and intranode MPI work as opposed to the standard shm for intra, EFA for inter setup) yields a crash at output write:

#0  0x2b2447f8062f in ???
#1  0x2b2446eb1219 in MPIDIG_put_target_msg_cb
        at ../../src/mpid/ch4/src/ch4r_rma_target_callbacks.c:1501
#2  0x2b2447253a94 in MPIDI_OFI_handle_short_am
        at ../../src/mpid/ch4/netmod/include/../ofi/ofi_am_events.h:112
#3  0x2b2447253a94 in am_recv_event
        at ../../src/mpid/ch4/netmod/ofi/ofi_events.c:695
#4  0x2b2447249b5d in MPIDI_OFI_dispatch_function
        at ../../src/mpid/ch4/netmod/ofi/ofi_events.c:830
#5  0x2b2447248b8f in MPIDI_OFI_handle_cq_entries
        at ../../src/mpid/ch4/netmod/ofi/ofi_events.c:957
#6  0x2b2447267936 in MPIDI_OFI_progress
        at ../../src/mpid/ch4/netmod/ofi/ofi_progress.c:40
#7  0x2b2446e6cc1e in MPIDI_Progress_test
        at ../../src/mpid/ch4/src/ch4_progress.c:181
#8  0x2b2446e6cc1e in MPID_Progress_test
        at ../../src/mpid/ch4/src/ch4_progress.c:236
#9  0x2b244740bcf5 in MPIDIG_mpi_win_fence
        at ../../src/mpid/ch4/src/ch4r_win.h:489
#10  0x2b244740bcf5 in MPIDI_NM_mpi_win_fence
        at ../../src/mpid/ch4/netmod/include/../ofi/ofi_win.h:223
#11  0x2b244740bcf5 in MPID_Win_fence
        at ../../src/mpid/ch4/src/ch4_win.h:259
#12  0x2b244740bcf5 in PMPI_Win_fence
        at ../../src/mpi/rma/win_fence.c:108
#13  0x2b24467d55dc in pmpi_win_fence_
        at ../../src/binding/fortran/mpif_h/win_fencef.c:269
#14  0x2b2443636241 in __pfio_rdmareferencemod_MOD_fence
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/RDMAReference.F90:159
#15  0x2b244348a1fa in __pfio_baseservermod_MOD_receive_output_data
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/BaseServer.F90:75
#16  0x2b24434b84b6 in __pfio_serverthreadmod_MOD_handle_done_collective_stage
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/ServerThread.F90:972
#17  0x2b244342a647 in __pfio_messagevisitormod_MOD_handle
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/MessageVisitor.F90:269
#18  0x2b2443429d92 in __pfio_abstractmessagemod_MOD_dispatch
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/AbstractMessage.F90:115
#19  0x2b2443470008 in __pfio_simplesocketmod_MOD_send
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/SimpleSocket.F90:105
#20  0x2b24434d8725 in __pfio_clientthreadmod_MOD_done_collective_stage
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/ClientThread.F90:429
#21  0x2b24434ed671 in __pfio_clientmanagermod_MOD_done_collective_stage
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/pfio/ClientManager.F90:381
#22  0x2b2442267557 in run
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/base/MAPL_HistoryGridComp.F90:3577

@lizziel FYI

tclune commented 3 years ago

Changing the o-server options requires also requesting more MPI processes. But I would have hoped that we issue an error message rather than hanging. I'll ask the @weiyuan-jiang to check on whether we can trap that error in a more useful manner.

The biggest signal here still seems to be that the problems happen for both OpenMPI and IntelMPI which strongly indicates a problem on the MAPL side. I suppose it is possible that both flavors are exercising the same bug in some lower lever layer supporting the fabric, but that would be a first.

At the same time, of course, GEOS is running such cases without issue using those flavors of MPI. Very frustrating. (More so for you I am sure.)

mathomp4 commented 3 years ago

Query: Are you disabling hyperthreading on your compute nodes? I know that people here testing on AWS (with various programs and benchmarks) found that they had to disable hyperthreading on the AWS c5/c5n (for example) nodes otherwise performance was not as good as expected.

Note I am trying to get back in the AWS game but GCHP is quite ahead of me. It might be a while until I can test at this scale! (That is, I'll need some serious help from the AWS gurus at NCCS.)

WilliamDowns commented 3 years ago

Hyperthreading is in fact disabled.

GCHP on other non-EFA systems has been working perfectly fine at least. This isn't the first issue we've had with running GCHP using EFA specifically, though previous issues were specific to different MPI providers and were resolved by later fixes to Intel MPI / OpenMPI / Libfabric.

Changing the o-server options requires also requesting more MPI processes. But I would have hoped that we issue an error message rather than hanging. I'll ask the @weiyuan-jiang to check on whether we can trap that error in a more useful manner.

To clarify, do you mean I need to request more cores for the job in general or that I need to edit something in MAPL_CapOptions to balance the number of model processes with the number of output processes? Other variables I've fiddled with besides the ones I mentioned were oserver_type, npes_output_backend, and n_i/oserver_group, which did result in some errors depending on what I was enabling/disabling without altering other variables.

WilliamDowns commented 3 years ago

Libfabric just released 1.12.0 officially two days ago (I had previously been using their development branch after 1.11.2 wasn't working), so I'll try updating to that new release and see if it solves anything.

tclune commented 3 years ago

Yes - you need to request additional processes on the mpirun command. The total number of processes needs to be the number of processes for the app plus the number for I/O. The comms are split by node, so it is ok to have extra processes for the app that are not used.

A concrete example may help. Suppose for simplicity that there are 10 cores per node and the app requires 45 processes. You therefore need 5 nodes for the model and say 1 extra node for the I/O. You would request 60 processes. 45 would be used for the model, 10 for I/O, and the extra 5 cores on the model side will be idled.

The actual parameters for managing the I/O server have changed a bit since I was involved in the design phase. @weiyuan-jiang can comment on how to set the options for a realistic run. (Be sure to specify the MAPL version, as this aspect has probably evolved some more.)

weiyuan-jiang commented 3 years ago

Before I jump in, I would like to know how GCHP is setup and run. I believe the problem is caused by the shared memory across the nodes. If the newer frabric is not working, we should setup an Oserver (MultiGroupServer) for GCHP

WilliamDowns commented 3 years ago

Libfabric 1.12.0 did not solve the issue.

Relevant run configuration files (the mostly default contents of which I'll past below) include a CAP.rc file:

ROOT_NAME: GCHP
ROOT_CF: GCHP.rc
HIST_CF: HISTORY.rc

BEG_DATE:     20160701 000000
END_DATE:     20160701 002000
JOB_SGMT:     00000000 002000

HEARTBEAT_DT:  600

MAPL_ENABLE_TIMERS: YES
MAPL_ENABLE_MEMUTILS: YES
PRINTSPEC: 0  # (0: OFF, 1: IMPORT & EXPORT, 2: IMPORT, 3: EXPORT)
USE_SHMEM: 0
REVERSE_TIME: 0

and a GCHP.rc file:

# Atmospheric Model Configuration Parameters
# ------------------------------------------
NX: 6
NY: 48

GCHP.GRID_TYPE: Cubed-Sphere
GCHP.GRIDNAME: PE90x540-CF
GCHP.NF: 6
GCHP.IM_WORLD: 90
GCHP.IM: 90
GCHP.JM: 540
GCHP.LM: 72

# For stretched grid
#GCHP.STRETCH_FACTOR: 2.0
#GCHP.TARGET_LON: 242.0
#GCHP.TARGET_LAT: 37.0

# For FV advection do not use grid comp name prefix
IM: 90
JM: 540
LM: 72

GEOSChem_CTM: 1

AdvCore_Advection: 1
        DYCORE: OFF
  HEARTBEAT_DT: 600

SOLAR_DT:    600
IRRAD_DT:    600
RUN_DT:      600
GCHPchem_DT: 1200
RRTMG_DT:    10800
DYNAMICS_DT: 600

SOLARAvrg: 0
IRRADAvrg: 0

GCHPchem_REFERENCE_TIME: 001000

# Print Resource Parameters (0: Non-Default values, 1: ALL values)
#-----------------------------------------------------------------
PRINTRC: 0

# Set the number of parallel I/O processes to use when
# RESTART_TYPE and or CHECKPOINT_TYPE are set to pbinary or pnc4
#---------------------------------------------------------------
PARALLEL_READFORCING: 0
NUM_READERS: 1
NUM_WRITERS: 1

# Active observer when desired
# ----------------------------
BKG_FREQUENCY: 0

# Settings for production of restart files
#---------------------------------------------------------------
# Record frequency (HHMMSS) : Frequency of restart file write
#                             Can exceed 24 hours (e.g. 1680000 for 7 days)
# Record ref date (YYYYMMDD): Reference date; set to before sim start date
# Record ref time (HHMMSS)  : Reference time
#RECORD_FREQUENCY: 1680000
#RECORD_REF_DATE: 20000101
#RECORD_REF_TIME: 000000

# Chemistry/AEROSOL Model Restart Files
# Enter +none for GCHPchem_INTERNAL_RESTART_FILE not use an initial restart file
# -------------------------------------
GCHPchem_INTERNAL_RESTART_FILE:     +initial_GEOSChem_rst.c90_fullchem.nc
GCHPchem_INTERNAL_RESTART_TYPE:     pnc4
GCHPchem_INTERNAL_CHECKPOINT_FILE:  gcchem_internal_checkpoint
GCHPchem_INTERNAL_CHECKPOINT_TYPE:  pnc4

DYN_INTERNAL_RESTART_FILE:    -fvcore_internal_rst
DYN_INTERNAL_RESTART_TYPE:    pbinary
DYN_INTERNAL_CHECKPOINT_FILE: -fvcore_internal_checkpoint
DYN_INTERNAL_CHECKPOINT_TYPE: pbinary
DYN_INTERNAL_HEADER:          1

RUN_PHASES:           1

#
# %%% HEMCO configuration file %%%
#
HEMCO_CONFIG:         HEMCO_Config.rc

#
# %%% Log file names for redirecting stdout %%%
#
STDOUT_LOGFILE:       PET%%%%%.GEOSCHEMchem.log
STDOUT_LOGLUN:        700

#
# %%% Memory debug print level (integer 0 to 3; 0=none, 3=highest)
#
MEMORY_DEBUG_LEVEL:   0

#
# %%% Option to write restart files via o-server
#
WRITE_RESTART_BY_OSERVER: YES

I'm using Slurm to submit runs (usually with srun, but can use mpirun). Relevant pieces of our standard submission scripts look like:

#SBATCH -n 288
#SBATCH -N 8
##SBATCH --exclusive
#SBATCH -t 0-4:00
#SBATCH --mem=MaxMemPerNode
##SBATCH --mem=110000
#SBATCH --mail-type=ALL

# Define GEOS-Chem log file
log="gchp.log"

# Sync all config files with settings in runConfig.sh
source runConfig.sh > ${log}
if [[ $? == 0 ]]; then

    gchp_env=$(readlink -f gchp.env)

    source ${gchp_env} >> ${log}

    # Use SLURM to distribute tasks across nodes
    NX=$( grep NX GCHP.rc | awk '{print $2}' )
    NY=$( grep NY GCHP.rc | awk '{print $2}' )
    coreCount=$(( ${NX} * ${NY} ))
    planeCount=$(( ${coreCount} / ${SLURM_NNODES} ))
    if [[ $(( ${coreCount} % ${SLURM_NNODES} )) > 0 ]]; then
        ${planeCount}=$(( ${planeCount} + 1 ))
    fi

    which mpirun >> ${log}
    echo $MPI_ROOT >> ${log}
    # Start the simulation
    #time srun -n ${coreCount} -N ${SLURM_NNODES} -m plane=${planeCount} --mpi=pmi2 ./gchp >> ${log}
    mpirun -np 288 ./gchp >> ${log}

For tweaking different CapOptions, I've been changing values directly in MAPL_CapOptions.F90 and recompiling. My MAPL is v2.5 + https://github.com/GEOS-ESM/MAPL/commit/eda17539c040f5953c7e0656c342da4826a613bc and https://github.com/GEOS-ESM/MAPL/commit/bb20beeba61430069bf751ac27d89f540862d796.

Let me know if there's any other info you need (also @lizziel feel free to chime in with anything I missed).

weiyuan-jiang commented 3 years ago

How can I checkout GCHP codes?

WilliamDowns commented 3 years ago

Source code is available from https://github.com/geoschem/GCHP. Compiling and run directory creation instructions are available at https://gchp.readthedocs.io/en/latest/index.html. I'll push a branch that includes the extra updates I included (MAPL 2.4->2.5 + those two commits I mentioned) momentarily.

weiyuan-jiang commented 3 years ago

It seems to me that GCHP should be able to run as GEOSgcm does with the flap options. Would you please compile and run this program in your system and send back the print out to me?

mpif90 test.F90 mpirun -np 100 ./a.out

test.F90 program main use mpi implicit none integer :: node_comm, node_npes, node_rank integer :: rank, npes, ierror

call MPI_init(ierror)

call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierror) call MPI_Comm_size(MPI_COMM_WORLD, npes, ierror)

if (rank == 0) print *, "running ", npes, "processes"

call MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, node_comm,ierror) call MPI_Comm_size(node_comm, node_npes,ierror) call MPI_Comm_rank(node_comm, node_rank,ierror)

if (node_rank == 0) print*, "node cores :", node_npes

call MPI_Finalize(ierror)

end program

Here is the print out I ran on 40-core skylake of NCCS

mpirun -np 100 ./a.out srun: cluster configuration lacks support for cpu binding running 100 processes node cores : 40 node cores : 40 node cores : 20

WilliamDowns commented 3 years ago

On 36-core AWS EC2 c5n.18xlarge nodes:

running          100 processes
 node cores :          34
 node cores :          33
 node cores :          33

Also, the branch I mentioned above is now available on the GCHP repo as feature/oserver_tweaks.

Thank you very much for your assistance. Let me know what more you need me to do / provide.

tclune commented 3 years ago

@weiyuan-jiang GCHP does not want to use FLAP. We've already accommodated them there. They do use the Options (via CapOptions) but sidestepping FLAP itself entirely. Just wanted to be clear about that.

weiyuan-jiang commented 3 years ago

Then we probably should provide an alternative for command line interface input. How about an input.nml file? or a YAMAL file?

tclune commented 3 years ago

They don't want a command line interface. Otherwise they would use FLAP. All that FLAP is doing is populating CapOptions. GCHP populates those options via a different mechanism.

weiyuan-jiang commented 3 years ago

We can test what this issue was. The MAPL version you have now must have shared memory across nodes if oserver uses multi nodes. Notes that without any configuration, the oserver overlap with the application ( For example, it has 8 nodes in above example you post) . We can try a single node oserver though. Here is how to configure ( I am using you example):

1) Acquire 9 nodes :

SBATCH -n 324

SBATCH -N 9

2) mpirun -np 324 ./gchp >> ${log}

3) you need to fill in ( hard coded for now?) the MAPL_CapOption's components:

npes_model = 288 nodes_output_server =[1] oserver_type = 'single'

I would like to see the error message if it fails to run.

WilliamDowns commented 3 years ago

So after testing and getting some errors I discovered that I had previously been glossing over an extra step in our run configuration process: we use a script to automatically set some run variables, and I had been inputting the total number of cores and nodes (in this case 324/9) there. This was setting NX and NY in GCHP.rc to take up all of the 324 cores. Changing those settings to use 288/8 instead and using your edits yielded a successful (and fast) run with EFA enabled for the first time!

WilliamDowns commented 3 years ago

I also confirmed that this runs successfully in the version of MAPL in our main branch of GCHP (2.2.7?) if I additionally set npes_output_server=[36].

weiyuan-jiang commented 3 years ago

That is encouraging. NX x NY is the npes_model, the total is npes_model + npes_oserver. However the total will exceed the sum in case npes_model can not occupy the whole node. Lets take an example to do the calculation. In a 36-core node, if nx = 10, ny = 60, it will take 17 nodes. So for 1-node oserver, the total would be 17*36 + 36 = 648. The npes_model is still 600. Those extra cores will be "wasted". If you are still confusing, please let me know.

Now lets get back to your example, you can setup multiple independent oservers. nodes_output_server =[1,1,1] will give you three oservers. Off Course you need to increase the resource to 11 nodes.

If those oservers cannot meet your high-core work, a new version of MAPL ( at least v2.6.2) that does not use shared memory across the nodes is necessary. I can help that set up the CapOptions too

WilliamDowns commented 3 years ago

Thanks for the explanation, that makes sense. I'm trying out higher core counts / resolutions, and 1 node is giving me a segfault / HDF5 error at c180 on ~1100 cores. I'm trying to do 3 nodes as you described, but if I only change nodes_output_server=[1,1,1] (plus adjusting npes_model) I get:

Starting pFIO output server on 1 nodes
Starting pFIO output server on 1 nodes

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1080 PID 22696 RUNNING AT compute-dy-c5n18xlarge-31
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

Error termination. Backtrace:
At line 217 of file /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/base/ServerManager.F90
Fortran runtime error: Index '2' of dimension 1 of array 'npes_out' above upper bound of 1

Error termination. Backtrace:
#0  0x2ad0ab7500c9 in __mapl_servermanager_MOD_initialize
        at /home/centos/gchp_fullchem_intelmpi/CodeDir/src/MAPL/base/ServerManager.F90:217

Are there other settings I need to change to run with multiple oservers?

weiyuan-jiang commented 3 years ago

can you post line 217 for me?

WilliamDowns commented 3 years ago

215           if (rank == 0 .and. nodes_out(i) /=0 ) then
216              write(*,'(A,I0,A)')"Starting pFIO output server on ",nodes_out(i)," nodes"
217           else if (rank==0 .and. npes_out(i) /=0 ) then
218             write(*,'(A,I0,A)')"Starting pFIO output server on ",npes_out(i)," pes"
219           end if

weiyuan-jiang commented 3 years ago

I did find a bug here: if (rank == 0 .and. nodes_out(1) /=0 ) then write(,'(A,I0,A)')"Starting pFIO output server on ",nodes_out(i)," nodes" else if (rank==0 .and. npes_out(1) /=0 ) then write(,'(A,I0,A)')"Starting pFIO output server on ",npes_out(i)," pes" end if

Note changes are made in the conditions (i) ->(1) . The write lines still use nodes_out(i) and npes_out(i) . That applies to iserver too.

But that does not seem to solve your case. I need to know line 217 anyway.

WilliamDowns commented 3 years ago

Yeah unfortunately I made that change and now the model says Starting pFIO output server on 1 nodes once and then hangs.

weiyuan-jiang commented 3 years ago

In my test.F90, would please change the line to

if (node_rank == 0) print*, "node root rank :", rank

I want to see how the ranks are arranged in AWS .

weiyuan-jiang commented 3 years ago

And also please post your setup and run script here. Thanks

WilliamDowns commented 3 years ago

 running          100 processes
 node root rank:           0
 node root rank:          67
 node root rank:          34

weiyuan-jiang commented 3 years ago

The rank arrangement looks right to me. I am still confused that you cannot start multiple oservers. Can you still run 1-node oserver with the high count run?

WilliamDowns commented 3 years ago

I just realized that for my 3 oserver run I had forgotten to up my mpirun process count from 324 to 396 in my mpirun call. Fixing that causes multiple oserver c90 runs to work properly. My apologies for the confusion.

The c180 runs (tested with multiple oservers and 1 oserver) progress all the way to the checkpoint writing stage (past writing output, where the main hang I opened this issue for was happening), at which point it fails with an HDF5 error (I've removed some of the stack here):

 Character Resource Parameter: GCHPchem_INTERNAL_CHECKPOINT_TYPE:pnc4
 Using parallel NetCDF for file: gcchem_internal_checkpoint
 Large pool oserver is chosen, nwriting and server size :           1          36

There are 1 HDF5 objects open!

Report: open objects on 72057594037927936

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1116 PID 11314 RUNNING AT compute-dy-c5n18xlarge-32
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x2b2cd22a262f in ???
#1  0x2b2ccfe571b0 in H5G_rootof
        at /tmp/centos/spack-stage/spack-stage-hdf5-1.10.7-hzcvzrmwqhczeq2ib5h6pjgfpkxqmtkc/spack-src/src/H5Groot.c:120
#2  0x2b2ccfe57a16 in H5G_root_loc
        at /tmp/centos/spack-stage/spack-stage-hdf5-1.10.7-hzcvzrmwqhczeq2ib5h6pjgfpkxqmtkc/spack-src/src/H5Groot.c:388
#3  0x2b2ccfe4e5cb in H5G_loc
        at /tmp/centos/spack-stage/spack-stage-hdf5-1.10.7-hzcvzrmwqhczeq2ib5h6pjgfpkxqmtkc/spack-src/src/H5Gloc.c:165
#4  0x2b2ccfe85443 in H5Iget_name
        at /tmp/centos/spack-stage/spack-stage-hdf5-1.10.7-hzcvzrmwqhczeq2ib5h6pjgfpkxqmtkc/spack-src/src/H5I.c:2016
#5  0x2b2cd2b79759 in ???
#6  0x2b2cd2b79967 in ???
#7  0x2b2cd2b799e6 in ???
#8  0x2b2cd2b7b109 in ???
#9  0x2b2cd2b7ae2a in ???
#10  0x2b2cd2b7aff3 in ???
#11  0x2b2cd2b7b5cf in ???
#12  0x2b2cd2b25d06 in ???
#13  0x184b380 in __pfio_netcdf4_fileformattermod_MOD_close
        at /home/centos/GCHP/src/MAPL/pfio/NetCDF4_FileFormatter.F90:260
#14  0x18a1e7d in __pfio_historycollectionmod_MOD_clear
        at /home/centos/GCHP/src/MAPL/pfio/HistoryCollection.F90:105
etc.

That happens for my c180 test with either 1116 or 288 model cores (36-core/1-node oserver for each test). I'm rebuilding netCDF-Fortran and ESMF with explicit support for Parallel-netCDF to try and solve that issue.

I'm working on moving several configuration / environment files to post here.

WilliamDowns commented 3 years ago

Unfortunately rebuilding with Parallel-netCDF libraries didn't solve that seg fault issue. I'll see if I can make sure it's not an out-of-memory issue.

weiyuan-jiang commented 3 years ago

If it is memory issue, I suggest you use the new version of MAPL, whcih can have multiple nodes as oserver without shared memory.

tclune commented 3 years ago

@WilliamDowns How much flexibility do you have to actually bring in new versions of MAPL? We ordinarily coordinate such things with @lizziel .

WilliamDowns commented 3 years ago

I would prefer not to need to attempt a full update of MAPL if it can be avoided. The update to 2.5 was already available to me through a series of pending PRs, and I cherry-picked a few more recent commits as I mentioned. Would it be at all possible to get the non-shared memory oserver functionality through a series of individual commits?

Still looking into the cause of the HDF5 crash.

lizziel commented 3 years ago

I'm updating all of the GMAO libraries we use (13!) this week. I hope for it to be working very soon.

WilliamDowns commented 3 years ago

I actually had a "successful" run earlier today (output + checkpoints + timers at the end), but still with an error (no segfault though) appearing during final timers (after successful checkpoint write) indicating the HDF5 object was unable to be properly closed. I verified the checkpoint file is actually openable and all variables in it readable. I repeated this run 4 more times and ended up with 2 "successes" and 3 failures (segfault before checkpoint finishes writing). I also tried updating HDF5 from 1.10.7 to 1.12.0. Still trying to properly diagnose memory usage of these runs.

WilliamDowns commented 3 years ago

Unfortunately updating to MAPL 2.6.4 did not fix the HDF5 crash issue for Intel MPI on AWS at high resolutions. However, updating to 2.6.4 did allow me to try out OpenMPI with the oserver enabled for the first time, which does work (as opposed to Intel MPI). I can do a 1-hour c180 run on 288 model cores and get a total runtime of 24 minutes if using the EFA fabric vs. 11 minutes with TCP. When using Intel MPI, I nearly always get an HDF5 error when writing a checkpoint file (after 7 to 10 minutes). The rare successful runs take ~11 minutes, regardless of using TCP or EFA fabrics.

weiyuan-jiang commented 3 years ago

I am glad to know you have upgraded to v2.6.4. Now we got a chance to use different server type. What we need to do is change two parameters in CapOptions

oserver_type = 'multigroup' npes_backend_pernode = 5

With above configure, we don't use shared memory any more.

WilliamDowns commented 3 years ago

With those options (plus npes_model = 288, cap_options%nodes_output_server=[1,1,1], 11 nodes / 396 cores total), I encounter an apparent hang (but no crash after 25+ minutes) on the fabric / MPI combinations I've tested so far (Intel MPI + EFA and OpenMPI + TCP). This is the last output:

Character Resource Parameter: GCHPchem_INTERNAL_CHECKPOINT_FILE:gcchem_internal_checkpoint
 Character Resource Parameter: GCHPchem_INTERNAL_CHECKPOINT_TYPE:pnc4
 Using parallel NetCDF for file: gcchem_internal_checkpoint
 Large pool oserver is chosen, nwriting and server size :           1          31

weiyuan-jiang commented 3 years ago

Can you try npes_model = 288 nodes_output_server=[3] oserver_type = 'multigroup' npes_backend_pernode = 5

weiyuan-jiang commented 3 years ago

If no crash, you may use Ctr C to stop the program and see where it hangs.

WilliamDowns commented 3 years ago

Unfortunately that change yields an HDF5 crash as before with Intel MPI (completes fine with OpenMPI).

weiyuan-jiang commented 3 years ago

Are you using the HDF5 libs build by @mathomp4 in the baselibs?

tclune commented 3 years ago

No. GCHP does not use baselibs.

WilliamDowns commented 3 years ago

To confirm this isn't a universal issue with my HDF5 setup (built through Spack), I built the same set of libraries / MAPL updates on Harvard's cluster and completed a successful run with Intel MPI. On AWS I also tried setting WRITE_RESTART_BY_OSERVER to NO and still get the same HDF5 error.

WilliamDowns commented 3 years ago

Turns out the stack trace is actually occurring during writing model output diagnostics rather than during writing the checkpoint file. My current crashing Intel MPI setup has been outputting a corrupted file everytime for our largest default diagnostic (species concentration). If I disable the species concentration diagnostic, the model completes without error. If I disable all diagnostics except for species concentration, the model completes without error. If I enable all diagnostics but disable species concentration, the model crashes.

GEOS-ESM / MAPL

Extreme output performance degradation at higher core counts in GCHP using EFA fabric #739

SBATCH -n 324

SBATCH -N 9