ESCOMP / HEMCO_CESM

CESM/CAM interface to modular HEMCO chemistry emissions module
1 stars 8 forks source link

Reproducibility issues in HEMCO_CESM and investigation #31

Open jimmielin opened 7 months ago

jimmielin commented 7 months ago

This issue thread serves to note the reproducibility issues in HEMCO within CESM2 which should eventually be fixed for: https://github.com/ESCOMP/CAM/issues/856

For the purposes of debugging HEMCO_CESM, it is suggested to use CAM-chem compsets (e.g., FCnudged, FCclimo2010, ...) beuse CAM-chem is known to be b4b reproducible and GEOS-Chem compsets are likely not. The responsibility of this issue is to ensure that the physics buffer and history fields (e.g., HCO_NO, HCO_NH3, HCO_CO, ...) match bit-for-bit in restart, different MPI decomp, and different OpenMP threading scenarios.

Test/debug workflow

This setup will help debug the issues.

hemco_config_file = '/glade/u/home/hplin/2403_dev_hco_2.3/HEMCO_Config.CC.TestOnly.c240331.rc',

cam_physics_mesh = '/glade/campaign/cesm/cesmdata/inputdata/share/meshes/10x15_nomask_c110308_ESMFmesh.nc'
hemco_grid_xdim = 24,
hemco_grid_ydim = 19,

fincl1 = 'T', 'HCO_CO', 'HCO_NO', 'HCO_NH3', 'CO', 'O3', 'NO', 'HCO_EDGAR_TODNOX'
mfilt = 1,
nhtfrq = 1,

The /glade/u/home/hplin/2403_dev_hco_2.3/HEMCO_Config.CC.TestOnly.c240331.rc test config file only has CEDS with NO CO and NH3 with NO having a 1x1 gridded scale factor. This makes it easier to debug and much quicker to run.

Debugging output is in cesm.log.* and organized per CPU.

The cprnc tool is very useful to compare two netCDF files for bit-for-bit matches: I use this in my .zshrc

alias cprnc="/glade/campaign/cesm/cesmdata/cseg/tools/cime/tools/cprnc/cprnc"

Usage: cprnc <file1> <file2>

jimmielin commented 7 months ago

What I've found is that HCO_CO and HCO_NH3 match bit-for-bit (files are identical) between singlecore and mpi (2 cores), but not HCO_NO. This is because HCO_NO applies another field, EDGAR_TODNOX to it, and this field is somehow different in the two runs.

In singlecore's cesm.log for EDGAR_TODNOX field:

0:  hcdebug: (edgar, i=           1 )  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30
0:  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 -9.9999998E+30   1.302642
0:    1.306328       1.306328       1.306328       1.306328       1.306328
0:    1.306328       1.306328       1.319497       1.210987       1.210987
0:    1.020795
0:  hcdebug: (edgar, i=           2 )  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30
0:  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 -9.9999998E+30   1.361859
0:    1.361859       1.361859       1.361859       1.372795       1.372795
0:    1.372795       1.372795       1.391346       1.385491       1.385491
0:    1.020795

In mpi, note how in both CPUs 0: and 1:, the first 5 grid boxes have the fill value in:

0:  hcdebug: (edgar, i=           1 )  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30
0:  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 -9.9999998E+30   1.302642
0:    1.306328       1.306328
1:  hcdebug: (edgar, i=           1 )  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30
1:  -9.9999998E+30 -9.9999998E+30   1.374258       1.210987       1.210987
1:    1.019067
0:  hcdebug: (edgar, i=           2 )  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30
0:  -9.9999998E+30 -9.9999998E+30 -9.9999998E+30 -9.9999998E+30   1.361859
0:    1.361859       1.361859
1:  hcdebug: (edgar, i=           2 )  -9.9999998E+30   1.405800       1.405800
1:    1.405800       1.405800       1.405477       1.385491       1.385491
1:    1.019067

For HCO_NO emissions at surface:

0:  hcdebug: writing out lvl-sfc at present dt
0:  hcdebug: (i=           1 )   0.000000000000000E+000  0.000000000000000E+000
0:   0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
0:   0.000000000000000E+000  0.000000000000000E+000  2.544706480154566E-014
0:   2.170436195605955E-014  3.528874920258346E-017  0.000000000000000E+000
0:   1.880868963362995E-016  1.772452777726410E-015  0.000000000000000E+000
0:   3.712318333349080E-013  2.388847495132503E-014  3.953244476404797E-014
0:   0.000000000000000E+000  0.000000000000000E+000

In the MPI configuration:

0:  hcdebug: (i=           1 )   0.000000000000000E+000  0.000000000000000E+000
0:   0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
0:   0.000000000000000E+000  0.000000000000000E+000  2.544706480154566E-014
0:   2.170436195605955E-014  3.528874920258346E-017
1:  hcdebug: (i=           1 )   0.000000000000000E+000  1.439813645198005E-016
1:   1.356820567806388E-015  0.000000000000000E+000  2.841796369544945E-013
1:   2.487986842177408E-014  3.953244476404797E-014  0.000000000000000E+000
1:   0.000000000000000E+000

Note how the numbers in the first CPU match bit-for-bit but not in the second one. The values in i=1 should be contiguous starting from CPU 0 to CPU 1, as the values in the only CPU in the single core result.

The domain decomposition for 10x15 domain is: Global domain size: 24 longitudes, 19 latitudes (24x19)

For 2 CPUs, the domain is chopped horizontally in the middle where each CPU covers all longitudes but half of the latitudes.

The print outputs are directly from HCO_GetPtr or from %Emis%Val so they're from HEMCO upstream code and before it hits the regridder. So I feel there is a bug somewhere upstream in HEMCO or in Map_A2A but I have not looked into the rabbit hole of hco_readlist_mod and hcoio_read_std_mod and map_a2a yet.

I've tried shifting the longitude edges for the HEMCO grid in hco_esmf_grid.F90 as having -187.5 as a starting value (it's equal to -180 minus 15/2) seemed sketchy to me. But either using -180.0 as the leftmost edge or -187.5 only makes numerical differences (expected, since the grid is now different to HEMCO) but does not make the difference between 1 core and 2 cores disappear.