[DISCUSSION] Enabling group-cyclic aggregator assignment helps stable successful C720 simulations on Lustre filesystem

1Dandan commented 1 year ago

Description

Hi, I am writing to share with you my practice on NASA Pleiades with C720 simulations. I am using GCHP v13.4.1 to run C720 simulations with or without mass fluxes feature. The restart files for C720 is very large (529 G) and I encountered occasional success or failure at the simulation stage of writing restart files. Finally I found enabling group-cyclic aggregator assignment (export MPIO_LUSTRE_GCYC_MIN_ITER=1) resolves the I/O problem to facilitate successful big simulations.

After reading the documentation on NASA (https://www.nas.nasa.gov/hecc/support/kb/porting-with-hpe-mpt_100.html), it looks like the reason could be: With large computing nodes (processes), it will have more aggregators to compete limited number of writing targets for collective buffering, which they call lock contention. With group-cyclic aggregator assignment enabled, it helps alleviate lock contention and thus helps write large files in parallel.

My actual experience is I got very stable longterm successful runs at C720 when I enabled it while only occasional success when I didn't. So hopefully it could potentially help other big simulations as well. Maybe we can add it (export MPIO_LUSTRE_GCYC_MIN_ITER=1) to gchp doc (https://gchp.readthedocs.io/en/latest/)?

The complete scripts I used are:

#!/bin/bash
 #PBS -S /bin/bash
 #PBS -l select=60:ncpus=40:mpiprocs=40:model=sky_ele
 #PBS -l walltime=28:00:00
 #PBS -l site=needed=/home4+/nobackupp12+/nobackupp16
 #PBS -l place=excl
 #PBS -N C720
 #PBS -m abe
 #PBS -r y
 #PBS -W group_list=s2387
 # loading environment
 source /u/dzhang8/gchp-intel.202209.env
 cd $PBS_O_WORKDIR

 module list # print loaded modules
 set -e      # if a subsequent command fails, treat it as fatal (don't continue)
 set -x      # for remainder of script, echo commands to the job's log file

 #ulimit -c 0
 #ulimit -l unlimited
 #ulimit -u 50000
 #ulimit -v unlimited
 #ulimit -s unlimited

 export MPI_LAUNCH_TIMEOUT=40
 export PATH=$PATH:/u/scicon/tools/bin
 export MPIO_LUSTRE_GCYC_MIN_ITER=1
 #export MPI_VERBOSE=1
 #export MPI_DISPLAY_SETTINGS=1
 #export MPI_OMP_NUM_THREADS=1
 export FOR_IGNORE_EXCEPTIONS=false
 export MPI_COLL_REPRODUCIBLE
 unset MPI_MEMMAP_OFF
 unset MPI_NUM_MEMORY_REGIONS
 export MPI_XPMEM_ENABLED=yes
 unset SUPPRESS_XPMEM_TRIM_THRESH
 unset PMI_RANK

 function last_checkpoint() {
     ls -1 gcchem_internal_checkpoint*.nc4 | tail -n 1
 }

 function last_checkpoint_datetime() {
     last_checkpoint | sed 's/gcchem_internal_checkpoint.\(20[12][0-9][0-9][0-9][0123][0-9]\)_\([0-9][0-9]00\).*/\1 \200/'
 }
 RESTART_DATE=SIM_START
 # Execute simulation
if [ ! -f CONTINUE_SEM ] ; then
     rm -f cap_restart #gcchem* 
     ./runConfig.sh
     touch CONTINUE_SEM
 else
     RESTART_FILE=$(last_checkpoint)
     RESTART_DATE=$(last_checkpoint_datetime)
     echo "$RESTART_DATE" > cap_restart
     sed -i "s/GCHPchem_INTERNAL_RESTART_FILE: .*/GCHPchem_INTERNAL_RESTART_FILE: $RESTART_FILE/g" GCHP.rc
 fi

 rm -f gcchem_internal_checkpoint
 several_tries mpiexec -np 2400 mbind.x -cs -v ./gchp.new &> run-HTAPv3-$(date +"%Y%m%d_%H%M").log
 pbs_release_nodes -j $PBS_JOBID -a

With source /u/dzhang8/gchp-intel.202209.env, the loaded libraries are:

  1) git/2.30.2               4) other/mepo               7) comp-gcc/11.2.0-TOSS3   10) hdf4/4.2.12             13) netcdf/4.4.1.1_serial
  2) cmake/3.21.0-TOSS3       5) other/gh                 8) comp-intel/2020.4.304   11) szip/2.1.1            
  3) other/manage_externals   6) ImageMagick/7.0.8-53     9) mpi-hpe/mpt.2.25        12) hdf5/1.8.18_serial

and ESMF v8.0.0 built with intel compiler and mpt library.

lizziel commented 1 year ago

HI @1Dandan, this is very interesting and helpful. Thank you for sharing. We will definitely add this to the documentation on ReadTheDocs. I am also tagging some of the NASA folks so they know: @tclune, @mathomp4, @bena-nasa

Would you be okay with sharing your Pleiades run script and environment file in the GEOS-Chem repository for others to use? The files would be stored in run/GCHP/sampleRunScripts/operational_examples/nasa_pleiades. If you are okay with this, you could either copy your env file here for us to copy or you could submit the two files via a pull request at the geos-chem repo. The advantage of doing a pull request is your name will be associated with the update.

tclune commented 1 year ago

Very interesting. Thank you for pushing on this.

lizziel commented 1 year ago

I just realized your script is for 13.4. We have some run script changes that you will need to incorporate for it to work with 14.0. This would impact doing a PR with your files. Would you like to upgrade to 14.0 and then submit your files? If not, no worries. We can adapt what you have in an example script. @Jourdan-He, could you follow-up about this since you are using Pleiades?

Jourdan-He commented 1 year ago

@lizziel Sure, I'll be happy to.

1Dandan commented 1 year ago

For sure. Actually I do have submitting scripts at version 14.0.0 which is run-2400.pbs.txt And the GCHP environment file: gchp-intel.202209.env.txt I have enabled public access for the module file included above. And also attach the module file here as a supplemental: 2022-09.Intel.txt

mathomp4 commented 1 year ago

Hmm. Never knew about that flag. I'll try it out myself with GEOS and see what's what. Might be useful.

Jourdan-He commented 1 year ago

Hi, I've created a PR for this issue. https://github.com/geoschem/geos-chem/pull/1563.

mathomp4 commented 1 year ago

One note. I asked NAS Support and apparently if you load their MPT modules, you should get MPIO_LUSTRE_GCYC_MIN_ITER=1 by default.

1Dandan commented 1 year ago

Thanks @mathomp4 for double check. Yeah, you are right. With a quick check I see MPIO_LUSTRE_GCYC_MIN_ITER=1 is already set in the module file of mpt. I am sorry for the confusion.

Hmmm.... at the same time of setting MPIO_LUSTRE_GCYC_MIN_ITER=1, I also set the stripe number to a static number of 50 (although the maximum number of stripes on /nobackupp12 (the directory I am landing on) is actually 23 and NASA Lustre system already implement dynamic striping). Before I made any changes, the simulation did just crash at around 60% chance at the stage of writing restart file. I will try to submit a testing simulation again to confirm.

I think we may need more tests for pull request before we really find the sources of crashed big simulation. @Jourdan-He

Sorry again for any confusions.

Jourdan-He commented 1 year ago

@1Dandan Sure! I'll update the PR when sources have been identified.

1Dandan commented 1 year ago

Just to update the status. I am currently experiencing significant slowness of my C720 simulations on NASA Electra and asked for help from NASA support to understand the reason.

1Dandan commented 1 year ago

Hi, I am writing to update the progress. As pointed out by @mathomp4, the environment variable of MPIO_LUSTRE_GCYC_MIN_ITER=1 is already enabled when pulling out the mpt library. So it actually did nothing by export it in the submission scripts as I initially suggested.

For the striping count setting, I have tested multiple 1-day walltime simulations with or without changing the default striping count, which all runs well, although I am testing them during the holiday so less I/O burden for the filesystem.

For the initial raising of this problem, I feel my c720 simulations on NASA Pleiades are very sensitive to the stability of the filesystem. The throughput for same restarted simulations can vary from 2.1 to 3.8 days per day for 1-day walltime simulations, and at worst scenarios, a simulation at c720 either received extremely slowness with throughput as low as 0.9 days per day or crashed usually at the stage of writing restart files.

Thus, I think the previously observed crashes are probably due to the concurrent occurrence of high I/O burden for the filesystem and my crashing simulations that I modified a little, leading to the wrong conclusion that it is because I "enabled" group-cyclic aggregator (actually already enabled when pulling out mpt library). I should export the verbose settings for MPI to verify before making any changes for environmental settings. I am very sorry about the confusion of raising this discussion.

In terms of running big simulations, besides improving the stabilities at writing restart files by MPI_IO, I am wondering if there is any way to reduce the size of the restart files, especially reducing the number of some short-lived species which are less sensitive to restarting concentrations. I guess it would require lots of work. Nonetheless, I think this ticket is ready to close and the pull request is not needed any more @Jourdan-He . I would open a new one if I found something useful to run c720 simulations.

Jourdan-He commented 1 year ago

@1Dandan Thanks for the update!

Jourdan-He commented 1 year ago

I'll close this issue, please feel free to open a new issue if you find anything useful

geoschem / GCHP

[DISCUSSION] Enabling group-cyclic aggregator assignment helps stable successful C720 simulations on Lustre filesystem #271

Description