Error writing checkpoints at high core counts

lizziel commented 4 years ago

I've been running GCHPctm with MAPL 2.2.7 for various grid resolutions and core counts on the Harvard Cannon cluster. I am encountering an error while writing checkpoint files when running with high core counts, in my case 1440 cores. The error is in UCX, so not MAPL specifically, but it is specific to the MAPL checkpoint files:

srun: error: holy2a01303: task 0: Killed
[holy2a19302:73246:0:73246]    ud_iface.c:746  Fatal: transport error: Endpoint timeout
==== backtrace ====
[holy2a19304:31734:0:31734]    ud_iface.c:746  Fatal: transport error: Endpoint timeout
==== backtrace ====
    0  /usr/lib64/libucs.so.0(ucs_fatal_error_message+0x68) [0x2b64849e1318]
    1  /usr/lib64/libucs.so.0(+0x17495) [0x2b64849e1495]
    2  /usr/lib64/ucx/libuct_ib.so.0(uct_ud_iface_dispatch_async_comps_do+0x121) [0x2b648b0267c1]
    3  /usr/lib64/ucx/libuct_ib.so.0(+0x1d902) [0x2b648b02a902]
    4  /usr/lib64/libucp.so.0(ucp_worker_progress+0x5a) [0x2b64843679ea]
    5  /n/helmod/apps/centos7/Comp/gcc/9.3.0-fasrc01/openmpi/4.0.2-fasrc01/lib64/libmpi.so.40(mca_pml_ucx_send+0x107) [0x2b6481f48727]
    6  /n/helmod/apps/centos7/Comp/gcc/9.3.0-fasrc01/openmpi/4.0.2-fasrc01/lib64/libmpi.so.40(MPI_Gatherv+0xf0) [0x2b6481e354c0]
    7  /n/helmod/apps/centos7/Comp/gcc/9.3.0-fasrc01/openmpi/4.0.2-fasrc01/lib64/libmpi_mpifh.so.40(pmpi_gatherv__+0xad) [0x2b648196212d]
    8  /n/holyscratch01/jacob_lab/elundgren/testruns/GCHPctm/13.0.0-alpha.10/scalability/gfortran93/gchp_standard_c180_1440core/./geos() [0x13d93e8]
   etc
===================
Program received signal SIGABRT: Process abort signal.

My libraries are as follows (plus UCX 1.6.0):

  1) git/2.17.0-fasrc01      7) zlib/1.2.11-fasrc02
  2) gmp/6.1.2-fasrc01       8) szip/2.1.1-fasrc01
  3) mpfr/3.1.5-fasrc01      9) hdf5/1.10.6-fasrc03
  4) mpc/1.0.3-fasrc06      10) netcdf/4.7.3-fasrc03
  5) gcc/9.3.0-fasrc01      11) netcdf-fortran/4.5.2-fasrc04
  6) openmpi/4.0.2-fasrc01  12) cmake/3.16.1-fasrc01

My run is at c180, and NX=16 and NY=90. I am using 24 cores per node across 60 nodes, reserving full 128G memory for each. Originally I encountered this error at the beginning of the run because I had periodic checkpoints configured (RECORD_* in GCHP.rc) which caused a checkpoint to be written at the beginning of the run. I turned that off and my run then got to the end, successfully wrote History files, but then again encountered the issue writing the checkpoint file.

@LiamBindle also encountered this problem on a separate compute cluster with c360 using 1200 cores.

Have you seen this before?

LiamBindle commented 4 years ago

I received a similar error running a C360 sim on 1200 cores. The error message I got was

 ExtData Run_: Calculating derived fields
 ExtData Run_: End
 Character Resource Parameter: GCHPchem_INTERNAL_CHECKPOINT_TYPE:pnc4
 Using parallel NetCDF for file: gcchem_internal_checkpoint.20160701_0000z.nc4
[compute1-exec-78:49   :0:49]    ud_iface.c:747  Fatal: transport error: Endpoint timeout
==== backtrace ====
    0  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/libucs.so.0(ucs_fatal_error_message+0x60) [0x7fb40badbaa0]
    1  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/libucs.so.0(ucs_fatal_error_format+0xde) [0x7fb40badbc0e]
    2  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/ucx/libuct_ib.so.0(+0x4d355) [0x7fb4035d2355]
    3  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/ucx/libuct_ib.so.0(uct_ud_iface_dispatch_async_comps_do+0x10b) [0x7fb4035d246b]
    4  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/ucx/libuct_ib.so.0(+0x5bb90) [0x7fb4035e0b90]
    5  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7fb40bf3cdba]
    6  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libmpi.so.40(mca_pml_ucx_progress+0x17) [0x7fb40da757d7]
    7  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libopen-pal.so.40(opal_progress+0x2b) [0x7fb40a51a3ab]
    8  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libmpi.so.40(mca_pml_ucx_send+0x275) [0x7fb40da77645]
    9  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libmpi.so.40(PMPI_Gatherv+0x190) [0x7fb40d95f830]
   10  /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libmpi_mpifh.so.40(MPI_Gatherv_f08+0xab) [0x7fb40dfb1d9b]
   11  /scratch1/liam.bindle/C1AT/GCHPctm/build-gnu/bin/geos() [0x136e57e]
   12  /scratch1/liam.bindle/C1AT/GCHPctm/build-gnu/bin/geos() [0x1380cec]
    .
    .
    .

My libraries are

bash-4.2$ spack find --loaded
==> 14 installed packages
-- linux-centos7-skylake_avx512 / gcc@8 -------------------------
esmf@8.0.0  hdf5@1.10.6  hwloc@1.11.11  libnl@3.3.0  libpciaccess@0.13.5  libxml2@2.9.9  lsf@10.1  netcdf-c@4.7.3  netcdf-fortran@4.5.2  numactl@2.0.12  openmpi@3.1.5  rdma-core@20  ucx@1.6.1  zlib@1.2.11

In my run I have NX=10, NY=120. I used 30 cores per node across 40 nodes with 300 GB or memory per node. Let me know if there's any more information that I can provide.

mathomp4 commented 4 years ago

A couple of things. First, do you set any OMPI_ environment variables or pass any mca options to the mpirun command?

Second, as a test can you see if adding:

WRITE_RESTART_BY_OSERVER: YES

to AGCM.rc (or your equivalent) does anything? I set it when I run GEOS with Open MPI but that's actually for a performance reason, not a 'things go crash' reason. (Or, conversely, if you run with that set to YES, can you try it with NO.)

LiamBindle commented 4 years ago

For me,

First, do you set any OMPI_ environment variables or pass any mca options to the mpirun command?

My only OMPI MCA setting is:

bash-4.2$ env | grep OMPI_
OMPI_MCA_btl_vader_single_copy_mechanism=none

I must admit, I'm not familiar with these settings. Our sysadmin set this and I've used it blindly.

For the second point, I'll give WRITE_RESTART_BY_OBSERVER a try and report back once the job has run!

lizziel commented 4 years ago

I've got a rerun in the queue with the WRITE_RESTART_BY_OSERVER setting and should have results tomorrow.

I actually use srun rather mpirun. I have nothing containing OMPI in env. Should I? My OpenMPI build settings are summarized in this file in case it's useful: ompi_info.txt

mathomp4 commented 4 years ago

It looks pretty standard. It looks like you built with UCX instead of verbs which I think is the current preferred method for Infiniband. I will note I often have issues with srun on discover, but maybe Cannon is different. I tend to stick with mpirun because I'm old. I suppose you could try that, but in the end I imagine Open MPI does the same thing.

The OMPI_MCA_btl_vader_single_copy_mechanism is one I've seen before when using Open MPI with containers and indeed: https://github.com/GEOS-ESM/MAPL/blob/34eae4b436695ded67a9830ffe47a286de897bc9/.circleci/config.yml#L10 so I can't complain.

If I had a thought from what you've said, it might be to try a newer version of UCX. Say one in the 1.8 or the new 1.9.0. Though maybe that'll just cause different errors...

LiamBindle commented 4 years ago

I checked this morning and my sim is ~3 days in, so it looks like

WRITE_RESTART_BY_OSERVER: YES

worked! Thanks! What does this switch do?

lizziel commented 4 years ago

This is now solved for me as well. My first run kept srun but added WRITE_RESTART_BY_OSERVER. It crashed before even hitting writing the first checkpoint, but I think this was due to cluster issues given a couple other runs inexplicably failed but are fine today. For my latest run I swtiched to mpirun which adds some uncertaintly on what the fix exactly was, srun or not using the o-server for restart write. I'll narrow it down.

mathomp4 commented 4 years ago

I checked this morning and my sim is ~3 days in, so it looks like
WRITE_RESTART_BY_OSERVER: YES
worked! Thanks! What does this switch do?

I'll ping @weiyuan-jiang to the thread to be more specific, but when I was trying to run with Open MPI on Discover, I found that it was taking ages to write out restarts. I think I eventually tracked it down to Open MPI having some bad MPI_GatherV (or Gather? can't remember) timings. Like stupid bad. And guess what calls are used when writing checkpoints/restarts? 😄

So, I asked around and it turns out @weiyuan-jiang added a (somewhat hidden) ability for the IOSERVER to write the restarts instead of the "normal" path. The IOSERVER uses Send/Recv I think, so it bypassed the bad performing call.

Now, I will say that in our GEOSldas @weiyuan-jiang found some sort of oddity happening with the WRITE_RESTART_BY_OSERVER method. I can't remember what (binary restarts?) but I have never seen any issues in my testing with the GCM.

lizziel commented 4 years ago

This is great. I'll add the line to the default GCHP.rc file for the GCHP 13.0.0 release, pending a response from @weiyuan-jiang on what the observed oddity was of course.

mathomp4 commented 4 years ago

@lizziel Note that I only turn this on with Open MPI. I keep our "default" behavior with Intel MPI, etc. because, well, it works so don't rock the boat.

(Well, we do need I_MPI_ADJUST_GATHERV=3 for Intel MPI because the other GatherV algorithms seemed to have issues on our system, etc.)

weiyuan-jiang commented 4 years ago

I am checking with @bena-nasa . Eventually, we will eliminate the parameter WRITE_RESTART_BY_OSERVER. So far without this parameter, the program goes to different branch which may cause problems.

mathomp4 commented 4 years ago

If you do eliminate it, that would probably mean I have to stop using Open MPI on discover. It is the only way I can write checkpoints due to the crazy slow MPI_GatherV performance.

lizziel commented 4 years ago

Could you update checkpoint writing to be similar to History writing so it avoids the problem?

weiyuan-jiang commented 4 years ago

Even we use WRITE_RESTART_BY_OSERVER, we still use mpi_gatherV. I am wondering if that is the problem.

tclune commented 4 years ago

@weiyuan-jiang But isn't it the case that when we use the OSERVER, the gatherv() is on a much smaller set of processes? For the main application there are many cores and there are therefore many very small messages. On the server it is much fewer cores and thus fewer and larger messages.

weiyuan-jiang commented 4 years ago

Oserver does not have mpi_gatherV. This gatherV happens in the client side only in 1d tile space. On the client side, it gathers all the data and then send it through oserver. For multi-dimension, WRITE_RESTART_BY_OSERVER bypassed the gatherV. @tclune

weiyuan-jiang commented 4 years ago

@lizziel Do you have problem after you set WRITE_RESTART_BY_OSERVER to yes?

bena-nasa commented 4 years ago

It looks like you hit an MPI problem in a gatherV. Like Weiyan said, if you do the write by oserver option it bypasses doing a gatherV and takes whole different code path to write the checkpoint. So you have sidestepped the problem by not exercising the code that was causing the initial problem.

lizziel commented 4 years ago

I have not noticed any run issues after setting WRITE_RESTART_BY_OSERVER to yes.

lizziel commented 4 years ago

I am going to close this issue. Please keep @LiamBindle and myself informed if there is a new fix in a future MAPL release, or if this fix is to be retired without a replacement for the problem.

LiamBindle commented 3 years ago

Last week I tried GCHP at C360 with Intel MPI on Compute1 (WashU cluster) and saw that the checkpoint file was being written extremely slowly. I saw @mathomp4's comment about about I_MPI_ADJUST_GATHERV=3, so I tried it and it fixed my problem. Thanks @mathomp4!

mathomp4 commented 3 years ago

@LiamBindle The other one to watch out for is I_MPI_ADJUST_ALLREDUCE. For some reason (on discover) we had to set that to 12. I think it was a weird allreduce crash inside of ESMF that @bena-nasa and I took a while to track down. Since then, we've always run GEOS with both the GATHERV and ALLREDUCE settings with Intel MPI.

LiamBindle commented 3 years ago

Thanks @mathomp4—I'll give that a try too.

GEOS-ESM / MAPL

Error writing checkpoints at high core counts #548