NCAR / DART

Data Assimilation Research Testbed
https://dart.ucar.edu/
Apache License 2.0
192 stars 143 forks source link

Feature request: Testing CESM+DART infrastructure #463

Open kdraeder opened 1 year ago

kdraeder commented 1 year ago

Use case

It would be very helpful to people who use CESM+DART if the released versions of CESM have been tested to ensure that they provide the functionality required by DART, such as the ability to build and run large, multi-instance jobs and interact with DART as an External System Processing component.

Is your feature request related to a problem?

Recently developed CESM components can be used and tested with DART only if the CESM infrastructure continues to enable it. Often in the past CESM or CIME development has neglected to test for this use, and the resulting release has been incompatible with DART (CESM issue #1807). This required people who are not experts in CESM code to inefficiently find and suboptimally fix the problems. This can take as long as it takes the CESM developers to release a new version, resulting in a seemingly endless cycle, during which DART cannot be used with recent CESMs.

Describe your preferred solution

(Further) Integrate testing of the functionality required by DART into the CESM testing suite. This is most important for major releases, which are most likely to be used with DART. But it would be useful whenever a new version of a component model is made available (if there's a DART interface to that component), so that DART can be used to evaluate it. We anticipate that the CESM testing would not involve running an assimilation, but just the features of CESM that enable that. After passing the CESM tests, then people who want to do the assimilation would do the full assimilation testing and model evaluation.

Describe any alternatives you have considered

See "Is this related to a problem"


The rest of the text is a discussion of strategies for implementing solutions.

Short list of issues

  1. Current status of testing (@alperaltuntas)
  2. Multi-instance + ensemble size
  3. SourceMods and other modified code (@braczka , sea ice person?)
  4. Compsets

Depth of testing

It's possible that a different level of testing could be done, based on what parts of CESM had been upgraded (e.g. multi-instance scripting vs a wave-model upgrade, which is currently irrelevant to DART) or what level of release is being made (e.g. CESM2_3_1 vs CESM3). ? Does the testing currently vary, depending on these things?

Size of Ensemble

Several times in the past, tests using 2-3 members have passed, but tests with larger numbers have failed, or shown unacceptable performance degradation. My (Kevin's) intuition is that at least 10 members are required to see the latter. For example, at one point the number of calls to the serial task, python build-namelist was a function of the number of instances squared. That was not noticable for 2 members, but for 80 it was 6400 calls.

Components

DART currently has interfaces to atm (CAM-FV, CAM-SE, WACCM(-X)), CLM5, POP(2?), and CICE. The potential next interfaces are to a river or land ice model. Work is underway to do assimilation with multiple components, but that will (probably) be hidden within the ESP component and not require testing by CESM. As of 2018-6 any compset that included a land ice component other than "stub" couldn't be used, because all of those other land ice models cannot use the gregorian calendar. A compset defined (2018) specifically for CAM6 assimilations is FHIST_DARTC6 = HIST_CAM60_CLM50%SP_CICE%PRES_DOCN%DOM_SROF_SGLC_SWAV.

SourceMods

Each DART interface may currently require SourceMods in order to build a model that works with DART effectively. These have been necessary partly because of the lack of testing within CESM. It may be easy, or at least appropriate, to include some of them into CESM, while others may not be. See attached files for examples. There may be changes to CIME, which we implemented for DART, which are not in the (cam.src) SourceMods. For example, /glade/work/raeder/Models/cesm2_1_relsd_m5.6/cime/src/drivers/mct/cime_config/buildnml has a time saving upgrade that just changes the log file name in modelio namelist files, instead of regenerating them in every assimilation cycle. There's no SourceMods mechanism for CIME code, that I know of, so it needs to be substituted manually. ? Do other components have nonSourceMods changes?

Size of the model(s)

We have not run into cases in which the resolution of the model was a factor in the testing or functionality of the code. Of course, it's always possible to exceed resources using a high resolution model, but that's not in the testing scope. So testing a "large" ensemble may not require a large number of nodes, which can delay testing.

kdraeder commented 1 year ago

Here's the SourceMods I developed for the CAM6 Reanalysis. cesm2_1_relsd_m5.6z_DART+CAM_SourceMods.tgz

Here are the changes to CIME I made for the CAM6 Reanalysis, which are candidates for merging into main. See the included file merge_list_2023-3-11 for details. DART_CIME_maint-5.6_mods.tgz

There were more changes needed (or useful), but they are too specific to DART and the Reanalysis to include. So we will still have mods that we need to install manually (unless there's a SourceMods mechanism for CIME).

kdraeder commented 1 year ago

It seems that we'll want to gather SourceMods (and other software variants) from the other components that have DART interfaces; CLM, POP, and CICE. (MPAS?, ...?) I don't have reliable access to those.

kdraeder commented 1 year ago

CIME github issue #2455 shows that a multi-instance test for CAM ("dartcambigens") has been developed and is being used in pre-beta tests . This may work for other components, or it could be used as a template.

Jim Edwards would like DART to be able to run with no modifications to CIME, so I'll open an issue in the CIME github to handle importing our changes.

kdraeder commented 1 year ago

So far several CAM Reanalysis modifications to CESM2.1 have been resolved in CESM2.3 (CMEPS mode).

  1. The slow creation of modelio_nml has been solved by handling them (in parallel) in fortran code, instead of serially in python.
  2. The inability of the driver to write "daily" auxiliary coupler history files ("forcing") at the end of a forecast that's < 24 hours has been fixed in CMEPS.
  3. The issue of incompatibilities between some (aux) history file names and their contents (related to averaging and the times in the files) appears to have been fixed by the uniformity and generality built into components/cmeps/cime_config/namelist_definition_drv.xml.

The next issue I'll try to resolve is DART's creation of several more file "types", that CESM's st_archive doesn't handle; means, spreads, obs_seq files, stages, etc.
The atm variation of this was controlled by

but there may be similar changes (hopefully the same) needed in the other components. My version is in /glade/u/home/raeder/cesm2_1_relsd_m5.6/SourceMods/src.cam/config_archive.xml. @braczka @amrhein @johnsonbk if you have any modified config_archive.xml for the ocn, lnd, etc., or strategies that you prefer for handling the new file types, please send them along. We may need to do this for CICE too, but without an expert on hand. I'd like to organize this before opening an issue in CESM.

hkershaw-brown commented 1 year ago
kdraeder commented 2 months ago

I'm working to include all DART output files in the st_archive process. I'd like to hear any thoughts about the following strategy.

The top level decision is that the assimilate.csh script for each component should rename the set of DART output files, which we want to archive, using the CESM file naming convention. This minimizes the changes to CESM code and will make the DART+CESM interfaces more uniform. It should also handle coupled assimilation, which may or will create DART output files for multiple components; obs_seq.final files for both CAM and POP.

Then there were questions about files that are associated with a component and DART, such as the ensemble of files for each stage. To me those seem like a kind of history file (as opposed to the other 2 archive categories; restart and log) of the component, so I chose to archive those in the archive/$component/hist directories.
This is accomplished by adding a history file extension .e. to each component's configarchive.xml file and naming the files ${casename}.${comp}_${instance}.e.${stage}\${domain}.${date}.nc, e.g. St_arch_beta17_3inst.clm2_0001.e.forecast_d01.1850-01-01-21600.nc This prevents the archive/esp/hist directory from becoming cluttered, and also results in shorter names.

I chose to archive the ensemble and inflation; mean and sd files in the archive/esp/hist directory, since they are more closely tied to DART than the components, in my view. E.g. St_arch_beta17_3inst.dart.e.clm2_analysis_mean_d01.1850-01-01-21600.nc If there's a good reason to put the _d01 somewhere else in the name, let me know.

The obs_seq.final files are also there, with a component in the name: St_arch_beta17_3inst.dart.e.cam_obs_seq_final.1850-01-01-21600

I also chose to rename the input.nml files as log files: da.cam.input.nml.log.5023008.desched1.240702-055618.gz The log file from assimilate.csh is still da.log.5023008.desched1.240702-055618.gz St_archive uses the 'log' in the name to archive them in the archive/logs directory.

braczka commented 2 months ago

@kdraeder, Thank you and sorry for delayed response. All of your choices seem reasonable to me. One alternative might be to create a separate 'archive/$component/DART' directory for all DART related files. However, we may be trying to stick to existing archived directories only?

kdraeder commented 2 months ago

Yeah, I'm trying to minimize the changes we request from CESM. But they may prefer your idea, so it should be part of the discussion. Kevin

On Tue, Jul 23, 2024 at 9:11 AM Brett Raczka @.***> wrote:

@kdraeder https://github.com/kdraeder, Thank you and sorry for delayed response. All of your choices seem reasonable to me. One alternative might be to create a separate 'archive/$component/DART' directory for all DART related files. However, we may be trying to stick to existing archived directories only?

— Reply to this email directly, view it on GitHub https://github.com/NCAR/DART/issues/463#issuecomment-2245520107, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADFFGECVECIWFGBVCZHVBT3ZNZXBRAVCNFSM6AAAAABLA5Q33GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBVGUZDAMJQG4 . You are receiving this because you were mentioned.Message ID: @.***>