Reproducibility issues in HEMCO_CESM and investigation

ESCOMP / HEMCO_CESM

CESM/CAM interface to modular HEMCO chemistry emissions module

1 stars 8 forks source link

This issue thread serves to note the reproducibility issues in HEMCO within CESM2 which should eventually be fixed for: https://github.com/ESCOMP/CAM/issues/856

For the purposes of debugging HEMCO_CESM, it is suggested to use CAM-chem compsets (e.g., FCnudged, FCclimo2010, ...) beuse CAM-chem is known to be b4b reproducible and GEOS-Chem compsets are likely not. The responsibility of this issue is to ensure that the physics buffer and history fields (e.g., HCO_NO, HCO_NH3, HCO_CO, ...) match bit-for-bit in restart, different MPI decomp, and different OpenMP threading scenarios.

Test/debug workflow

This setup will help debug the issues.

Checkout https://github.com/ESCOMP/CESM (ESCOMP/CESM). cesm2_3_alpha17c was used here but any release with HEMCO (post-cam6_3_118) should do.
./manage_externals/checkout_externals
Using this branch (https://github.com/jimmielin/HEMCO_CESM/tree/hplin/debug_parallel) hplin/debug_parallel from jimmielin/HEMCO_CESM for components/cam/src/hemco may be useful, as it has some debug printouts which will appear in cesm.log.
Create a case: ./create_newcase --case ~/2403_dev_hco_2.3/2403_dev_hco_2.3-f10_singlecore --compset FC2010climo_HCO --res f10_f10_mg37 --run-unsupported --mach derecho --project UHAR0022 -- the f10_f10_mg37 resolution is 10x15 degree and coarse enough to run on 1 core. I suggest using FC2010climo or something that is not FCnudged so configuring nudging / met fields can be avoided in user_nl_cam.
cd to case directory, ./xmlchange NTASKS=1 for single core or NTASKS=2 for two cores, etc. In the 10x15 case, NTHRDS=1 (I have not successfully ran with more than 1 thread on this grid)
./case.setup --reset, then fill user_nl_cam with:

hemco_config_file = '/glade/u/home/hplin/2403_dev_hco_2.3/HEMCO_Config.CC.TestOnly.c240331.rc',

cam_physics_mesh = '/glade/campaign/cesm/cesmdata/inputdata/share/meshes/10x15_nomask_c110308_ESMFmesh.nc'
hemco_grid_xdim = 24,
hemco_grid_ydim = 19,

fincl1 = 'T', 'HCO_CO', 'HCO_NO', 'HCO_NH3', 'CO', 'O3', 'NO', 'HCO_EDGAR_TODNOX'
mfilt = 1,
nhtfrq = 1,

The /glade/u/home/hplin/2403_dev_hco_2.3/HEMCO_Config.CC.TestOnly.c240331.rc test config file only has CEDS with NO CO and NH3 with NO having a 1x1 gridded scale factor. This makes it easier to debug and much quicker to run.

./case.build -v
To run >1 core on Derecho: you will run into this cryptic error with numactl (https://github.com/NCAR/mpibind/issues/5) - edit env_batch.xml and change the command in <directive gpu_enabled="false"> to always request 128 cores from the scheduler (it was {{ max_tasks_per_node}} -> to 128):
```
<directive> -l select={{ num_nodes }}:ncpus=128:mpiprocs={{ tasks_per_node }}:ompthreads={{ thread_count }}:mem=230GB</directive>
```
Change env_run.xml: RUN_STARTDATE=2016-01-01, STOP_OPTION=nhours, STOP_N=3 (shorter may not work due to coupling intervals)
Submit the case
Create multiple case directories for 1 core, 2 cores, etc. because clean recompile is needed to change core configuration.

Debugging output is in cesm.log.* and organized per CPU.

The cprnc tool is very useful to compare two netCDF files for bit-for-bit matches: I use this in my .zshrc

alias cprnc="/glade/campaign/cesm/cesmdata/cseg/tools/cime/tools/cprnc/cprnc"

Usage: cprnc <file1> <file2>

0: hcdebug: (edgar, i= 1 ) -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 0: -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 1.302642 0: 1.306328 1.306328 1.306328 1.306328 1.306328 0: 1.306328 1.306328 1.319497 1.210987 1.210987 0: 1.020795 0: hcdebug: (edgar, i= 2 ) -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 0: -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 1.361859 0: 1.361859 1.361859 1.361859 1.372795 1.372795 0: 1.372795 1.372795 1.391346 1.385491 1.385491 0: 1.020795

0: hcdebug: (edgar, i= 1 ) -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 0: -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 1.302642 0: 1.306328 1.306328 1: hcdebug: (edgar, i= 1 ) -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 1: -9.9999998E+30 -9.9999998E+30 1.374258 1.210987 1.210987 1: 1.019067 0: hcdebug: (edgar, i= 2 ) -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 0: -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 1.361859 0: 1.361859 1.361859 1: hcdebug: (edgar, i= 2 ) -9.9999998E+30 1.405800 1.405800 1: 1.405800 1.405800 1.405477 1.385491 1.385491 1: 1.019067

0: hcdebug: writing out lvl-sfc at present dt 0: hcdebug: (i= 1 ) 0.000000000000000E+000 0.000000000000000E+000 0: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0: 0.000000000000000E+000 0.000000000000000E+000 2.544706480154566E-014 0: 2.170436195605955E-014 3.528874920258346E-017 0.000000000000000E+000 0: 1.880868963362995E-016 1.772452777726410E-015 0.000000000000000E+000 0: 3.712318333349080E-013 2.388847495132503E-014 3.953244476404797E-014 0: 0.000000000000000E+000 0.000000000000000E+000

0: hcdebug: (i= 1 ) 0.000000000000000E+000 0.000000000000000E+000 0: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 0: 0.000000000000000E+000 0.000000000000000E+000 2.544706480154566E-014 0: 2.170436195605955E-014 3.528874920258346E-017 1: hcdebug: (i= 1 ) 0.000000000000000E+000 1.439813645198005E-016 1: 1.356820567806388E-015 0.000000000000000E+000 2.841796369544945E-013 1: 2.487986842177408E-014 3.953244476404797E-014 0.000000000000000E+000 1: 0.000000000000000E+000

ESCOMP / HEMCO_CESM

Reproducibility issues in HEMCO_CESM and investigation #31

Test/debug workflow