NCAR / DART

Data Assimilation Research Testbed
https://dart.ucar.edu/
Apache License 2.0
184 stars 139 forks source link

Multi-cplr (CESM) ensemble hindcast development #34

Closed timhoar closed 3 years ago

timhoar commented 4 years ago

Ed note: This issue was originally reported 28 April 2016 and is being manually ported over to GitHub.

running 1/10th degree pop in multi-instance may be infeasible because:

1) communication bottlenecks in the coupler 2) sequentially initialization time 3) number of nodes needed to run all instances at the same time

this means we will have to script running N separate instances of CESM, like the old days, and then moving the files into place so filter can run.

timhoar commented 4 years ago

Nancy Collins added a comment - 10/May/16 2:02 PM

george modica (AER) sent us his scripts for doing this with WACCM.

Kevin Raeder added a comment - 16/Aug/16 2:29 PM

Raffaele Montuoro (TAMU) has figured out a way to make a CESM job run multiple couplers, so the initializations can be done in parallel. He plans to use this in his regional coupled model (cplr7 + CLM + WRF + ROMS), but it should be useable in CESM assimilations.

The mechanism involves setting up a separate communicator for each coupler and running each coupler in a separate directory. It has a footprint in _cime/driver_cpl/cime_config/configcomponent.xml c_ime/cimeconfig/cesm/machines/Makefile _cime/driver_cpl/driver/cesm_comp_mod.F90 cime/driver_cpl/driver/cesmdriver.F90 (and probably more; see items below) The .F90 files can be included in a SourceMods, but the other changes need to be made in the CESM code tree (or inserted by our build script?).

We're putting it in cesm1_5_beta03 and using Kevin's $trunk_cam/shell_scripts/CESM1_5_setup_multicplr to build the case. Issues that are currently preventing this from working: 1) The calculation of NTASKS (not user settable) has not been sorted out. NTASKS is used to request the total number of tasks for the job, while Raffaele's algorithm wants to set NTASKS${COMP} = the number needed by a single coupler. NTASKS comes from from NTASKS_${COMP}, which are derived from $num_instances, $ptile, $num_cplrs, and $nodes_per_instance, 2) Each coupler runs in a separate run directory. The directories must be created and populated with the correct restart files for the instance(s) handled by that coupler. 3) There's a new namelist, which must be recognized by the CESM build scripts and properly broadcast.

Kevin Raeder added a comment - 18/Aug/16 9:52 AM - edited

Kevin is making set-up scripts to build and run a multi-coupler F compset CESM1_5. The modified files are temporarily /glade/u/home/raeder/DART/Trunk/models/cam/shell_scripts/

NINST_CPL and NINST_CPL_PREFIX will be XML variables available to DART's build scripts. PREFIX is the character string used to build the subdirectory names where the couplers will run.

Raffaele is working on implementing the multi_coupler cabability in the CESM scripts, which is much more work than implementing it in the source code, which he has already done.

Nancy Collins added a comment - 18/Aug/16 10:33 AM

i was confused by this, sorry, so here's my attempt at restating kevin's comment: raffaele has already implemented the multi-coupler functionality in the cesm source code, and now he's struggling through supporting this multi-coupler capability in the cesm scripting (which is turning out to be more tedious/harder than it seems like it should be).

Kevin Raeder added a comment - 07/Sep/16 4:05 PM

Raffaele and Kevin succeeded in adapting the CESM build and run scripts to the multi-coupler capability. Testing showed that results reproduce the single coupler output, both from CESM and from DART. But the build (and submit) time for full sized ensembles became impossibly long. Alicia found a partial solution by moving the pre-run script subroutines out of the loop over cycles, but the build is still impossibly slow. She believes that it stems from preview_namelists being called NINST_CPL^2 times; preview_namelists is called in a NINST_CPL loop in case.run, and preview_namelists has its own NINST_CPL loop around the buildnml calls. She has removed enough of that so that she can build a 30 member, B compset case. There may also be a task which is done NINST$COMPONENTS^2 times.

Nancy Collins added a comment - 01/Nov/16 9:21 AM

raffaele isn't working on this anymore - CSEG will be working on this eventually.

Kevin Raeder added a comment - 22/Dec/16 1:39 PM

Raffaelle resumed work on this. He refactored it so that it does not use a separate run directory for each coupler. Alicia and I tested it, and found some speed up in the ensemble forecast, but some slow down in the scripting to setup the forecast. He's working with cesm1_5_beta03, so more recent cesms may have better scripting, which will reduce the slow down. A code review with Mariana is in the works.

Kevin Raeder added a comment - 22/Dec/16 1:55 PM - edited

Raffaele traced the 'scripting' slow down to the generation of timing file(s). There was only one in the single-coupler case, but there are num_ens in the multi-coupler, which were being done sequentially. Parallelizing them results in the timing files creation taking only a few seconds, compared to a (few seconds x num_ens).

Nancy Collins added a comment - 28/Mar/17 2:23 PM

there are 2 issues here - running N independent CESM jobs, and running with multiple couplers. also kevin is supposed to bug mariana weekly (weakly?) to get multicoupler into the main CESM 2.

Kevin Raeder added a comment - 20/Mar/18 2:51 PM

I believe that the multi_drv (was variations of "multi-coupler") option has been fully implemented in CESM2 and scales more or less as expected with increasing ensemble size. The necessary build and run changes are being incorporated into the cam-fv/shell_scripts/cesm2_0 setup and assimilate scripts (Not in Manhattan as of 2018-3-21). There were impacts on st_archive too, which are incorporated into cimes after cime5.4.0-alpha23, which is packaged with CESMs after cesm2_0_alpha08f.

This does run all members at the same time, so it does not solve the issue of needing to run one instance at a time due to a limited number of nodes, communiation bottlenecks, or similar issues.

Nancy Collins added a comment - 21/Mar/19 2:10 PM

multi-driver is done. find (or create) 'run 1 instance at a time' request and then close this one.

kdraeder commented 3 years ago

There are additional accelerations of the CESM multi-instance scripting, which came out of the CAM6+DART Reanalysis. They involve the python generation of CESM namelist files. They are available in https://github.com/kdraeder/cime: branch cime_reanalysis_2019: src/drivers/mct/cime_config/buildnml. Also use --skip-preview-namelist option to case.submit, which prevents running preview_namelists for each of the DATA_ASSIMILATION_CYCLES.