Closed lizziel closed 4 years ago
I received a similar error running a C360 sim on 1200 cores. The error message I got was
ExtData Run_: Calculating derived fields
ExtData Run_: End
Character Resource Parameter: GCHPchem_INTERNAL_CHECKPOINT_TYPE:pnc4
Using parallel NetCDF for file: gcchem_internal_checkpoint.20160701_0000z.nc4
[compute1-exec-78:49 :0:49] ud_iface.c:747 Fatal: transport error: Endpoint timeout
==== backtrace ====
0 /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/libucs.so.0(ucs_fatal_error_message+0x60) [0x7fb40badbaa0]
1 /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/libucs.so.0(ucs_fatal_error_format+0xde) [0x7fb40badbc0e]
2 /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/ucx/libuct_ib.so.0(+0x4d355) [0x7fb4035d2355]
3 /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/ucx/libuct_ib.so.0(uct_ud_iface_dispatch_async_comps_do+0x10b) [0x7fb4035d246b]
4 /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/ucx/libuct_ib.so.0(+0x5bb90) [0x7fb4035e0b90]
5 /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/ucx-1.6.1-mcuwvv4bxhdyir2feixbksjpmymja2s7/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7fb40bf3cdba]
6 /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libmpi.so.40(mca_pml_ucx_progress+0x17) [0x7fb40da757d7]
7 /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libopen-pal.so.40(opal_progress+0x2b) [0x7fb40a51a3ab]
8 /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libmpi.so.40(mca_pml_ucx_send+0x275) [0x7fb40da77645]
9 /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libmpi.so.40(PMPI_Gatherv+0x190) [0x7fb40d95f830]
10 /opt/spack/opt/spack/linux-centos7-skylake_avx512/gcc-8/openmpi-3.1.5-cfthlxoibydafwjia7vserai7ta7ip56/lib/libmpi_mpifh.so.40(MPI_Gatherv_f08+0xab) [0x7fb40dfb1d9b]
11 /scratch1/liam.bindle/C1AT/GCHPctm/build-gnu/bin/geos() [0x136e57e]
12 /scratch1/liam.bindle/C1AT/GCHPctm/build-gnu/bin/geos() [0x1380cec]
.
.
.
My libraries are
bash-4.2$ spack find --loaded
==> 14 installed packages
-- linux-centos7-skylake_avx512 / gcc@8 -------------------------
esmf@8.0.0 hdf5@1.10.6 hwloc@1.11.11 libnl@3.3.0 libpciaccess@0.13.5 libxml2@2.9.9 lsf@10.1 netcdf-c@4.7.3 netcdf-fortran@4.5.2 numactl@2.0.12 openmpi@3.1.5 rdma-core@20 ucx@1.6.1 zlib@1.2.11
In my run I have NX=10, NY=120. I used 30 cores per node across 40 nodes with 300 GB or memory per node. Let me know if there's any more information that I can provide.
A couple of things. First, do you set any OMPI_ environment variables or pass any mca options to the mpirun
command?
Second, as a test can you see if adding:
WRITE_RESTART_BY_OSERVER: YES
to AGCM.rc
(or your equivalent) does anything? I set it when I run GEOS with Open MPI but that's actually for a performance reason, not a 'things go crash' reason. (Or, conversely, if you run with that set to YES, can you try it with NO.)
For me,
First, do you set any OMPI_ environment variables or pass any mca options to the mpirun command?
My only OMPI MCA setting is:
bash-4.2$ env | grep OMPI_
OMPI_MCA_btl_vader_single_copy_mechanism=none
I must admit, I'm not familiar with these settings. Our sysadmin set this and I've used it blindly.
For the second point, I'll give WRITE_RESTART_BY_OBSERVER
a try and report back once the job has run!
I've got a rerun in the queue with the WRITE_RESTART_BY_OSERVER
setting and should have results tomorrow.
I actually use srun
rather mpirun
. I have nothing containing OMPI in env. Should I? My OpenMPI build settings are summarized in this file in case it's useful:
ompi_info.txt
It looks pretty standard. It looks like you built with UCX instead of verbs which I think is the current preferred method for Infiniband. I will note I often have issues with srun
on discover, but maybe Cannon is different. I tend to stick with mpirun
because I'm old. I suppose you could try that, but in the end I imagine Open MPI does the same thing.
The OMPI_MCA_btl_vader_single_copy_mechanism
is one I've seen before when using Open MPI with containers and indeed:
https://github.com/GEOS-ESM/MAPL/blob/34eae4b436695ded67a9830ffe47a286de897bc9/.circleci/config.yml#L10
so I can't complain.
If I had a thought from what you've said, it might be to try a newer version of UCX. Say one in the 1.8 or the new 1.9.0. Though maybe that'll just cause different errors...
I checked this morning and my sim is ~3 days in, so it looks like
WRITE_RESTART_BY_OSERVER: YES
worked! Thanks! What does this switch do?
This is now solved for me as well. My first run kept srun
but added WRITE_RESTART_BY_OSERVER
. It crashed before even hitting writing the first checkpoint, but I think this was due to cluster issues given a couple other runs inexplicably failed but are fine today. For my latest run I swtiched to mpirun
which adds some uncertaintly on what the fix exactly was, srun
or not using the o-server for restart write. I'll narrow it down.
I checked this morning and my sim is ~3 days in, so it looks like
WRITE_RESTART_BY_OSERVER: YES
worked! Thanks! What does this switch do?
I'll ping @weiyuan-jiang to the thread to be more specific, but when I was trying to run with Open MPI on Discover, I found that it was taking ages to write out restarts. I think I eventually tracked it down to Open MPI having some bad MPI_GatherV (or Gather? can't remember) timings. Like stupid bad. And guess what calls are used when writing checkpoints/restarts? 😄
So, I asked around and it turns out @weiyuan-jiang added a (somewhat hidden) ability for the IOSERVER to write the restarts instead of the "normal" path. The IOSERVER uses Send/Recv I think, so it bypassed the bad performing call.
Now, I will say that in our GEOSldas @weiyuan-jiang found some sort of oddity happening with the WRITE_RESTART_BY_OSERVER
method. I can't remember what (binary restarts?) but I have never seen any issues in my testing with the GCM.
This is great. I'll add the line to the default GCHP.rc
file for the GCHP 13.0.0 release, pending a response from @weiyuan-jiang on what the observed oddity was of course.
@lizziel Note that I only turn this on with Open MPI. I keep our "default" behavior with Intel MPI, etc. because, well, it works so don't rock the boat.
(Well, we do need I_MPI_ADJUST_GATHERV=3
for Intel MPI because the other GatherV algorithms seemed to have issues on our system, etc.)
I am checking with @bena-nasa . Eventually, we will eliminate the parameter WRITE_RESTART_BY_OSERVER. So far without this parameter, the program goes to different branch which may cause problems.
If you do eliminate it, that would probably mean I have to stop using Open MPI on discover. It is the only way I can write checkpoints due to the crazy slow MPI_GatherV performance.
Could you update checkpoint writing to be similar to History writing so it avoids the problem?
Even we use WRITE_RESTART_BY_OSERVER, we still use mpi_gatherV. I am wondering if that is the problem.
@weiyuan-jiang But isn't it the case that when we use the OSERVER, the gatherv() is on a much smaller set of processes? For the main application there are many cores and there are therefore many very small messages. On the server it is much fewer cores and thus fewer and larger messages.
Oserver does not have mpi_gatherV. This gatherV happens in the client side only in 1d tile space. On the client side, it gathers all the data and then send it through oserver. For multi-dimension, WRITE_RESTART_BY_OSERVER bypassed the gatherV. @tclune
@lizziel Do you have problem after you set WRITE_RESTART_BY_OSERVER to yes?
It looks like you hit an MPI problem in a gatherV. Like Weiyan said, if you do the write by oserver option it bypasses doing a gatherV and takes whole different code path to write the checkpoint. So you have sidestepped the problem by not exercising the code that was causing the initial problem.
I have not noticed any run issues after setting WRITE_RESTART_BY_OSERVER to yes.
I am going to close this issue. Please keep @LiamBindle and myself informed if there is a new fix in a future MAPL release, or if this fix is to be retired without a replacement for the problem.
Last week I tried GCHP at C360 with Intel MPI on Compute1 (WashU cluster) and saw that the checkpoint file was being written extremely slowly. I saw @mathomp4's comment about about I_MPI_ADJUST_GATHERV=3
, so I tried it and it fixed my problem. Thanks @mathomp4!
@LiamBindle The other one to watch out for is I_MPI_ADJUST_ALLREDUCE
. For some reason (on discover) we had to set that to 12. I think it was a weird allreduce crash inside of ESMF that @bena-nasa and I took a while to track down. Since then, we've always run GEOS with both the GATHERV and ALLREDUCE settings with Intel MPI.
Thanks @mathomp4—I'll give that a try too.
I've been running GCHPctm with MAPL 2.2.7 for various grid resolutions and core counts on the Harvard Cannon cluster. I am encountering an error while writing checkpoint files when running with high core counts, in my case 1440 cores. The error is in UCX, so not MAPL specifically, but it is specific to the MAPL checkpoint files:
My libraries are as follows (plus UCX 1.6.0):
My run is at c180, and NX=16 and NY=90. I am using 24 cores per node across 60 nodes, reserving full 128G memory for each. Originally I encountered this error at the beginning of the run because I had periodic checkpoints configured (RECORD_* in GCHP.rc) which caused a checkpoint to be written at the beginning of the run. I turned that off and my run then got to the end, successfully wrote History files, but then again encountered the issue writing the checkpoint file.
@LiamBindle also encountered this problem on a separate compute cluster with c360 using 1200 cores.
Have you seen this before?