Updates and fixes to history zeroing and flushing

rgknox commented 6 months ago

Description:

Some multi-dimensioned history variables were not being reported correctly because they were not being zero'd. These variables were likely incremented over a patch-cohort loop, but being incremented on top of an uninitialized number. This set of changes seeks to make the flushing and zeroing more consistent across all of the different groups of history variables.

This also reduces confusion by moving flushing and zeroing to occur inside the routine that performs the updates. These flushing and zeroing calls still happen redundantly in the host models, should be removed, but are not a liability to generating incorrect results. These calls are just superfluous.

Fixes #1207

Collaborators:

@samsrabin

Expectation of Answer Changes:

Checklist

If this is your first time contributing, please read the CONTRIBUTING document.

All checklist items must be checked to enable merging this pull request:

Contributor

[ ] The in-code documentation has been updated with descriptive comments
[ ] The documentation has been assessed to determine if updates are necessary

Integrator

[ ] FATES PASS/FAIL regression tests were run
[ ] Evaluation of test results for answer changes was performed and results provided

Documentation

Technical Note update:
User's Guide update:

Test Results:

CTSM (or) E3SM (specify which) test hash-tag:

CTSM (or) E3SM (specify which) baseline hash-tag:

FATES baseline hash-tag:

Test Output:

samsrabin commented 5 months ago

Is there an addition or change we can/should make to the CTSM test list to avoid a bug like this in the future? E.g., I first noticed this with FATES_PATCHAREA_AP—is it not on any output file in the tests?

rgknox commented 5 months ago

There are other things to do with cleaning up the history files, but they are out of scope, so I think this PR is feature complete and ready for review. I ran a one-off test and it passsed ERS_Ld10.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdAllVars

rgknox commented 5 months ago

@samsrabin we periodically need to update the FatesColdAllVars test to make sure it has all of the variables. We have this script: tools/FindInactive.py that trawls the code and updates the list, seems like we need to update it.

glemieux commented 5 months ago

regression testing underway

glemieux commented 5 months ago

Regression testing on cheyenne against fates-sci.1.76.3_api.35.1.0-ctsm5.2.007 shows that most of the tests are returning B4B or the only diffs are in the fill values. That said, the following tests are return DIFFS:

FAIL ERP_Ld9.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdAllVars BASELINE fates-sci.1.76.3_api.35.1.0-ctsm5.2.007: DIFF
FAIL ERS_D_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdPRT2 BASELINE fates-sci.1.76.3_api.35.1.0-ctsm5.2.007: DIFF
FAIL ERS_D_Ld5.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdHydro BASELINE fates-sci.1.76.3_api.35.1.0-ctsm5.2.007: DIFF
FAIL ERS_Lm12.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesFireLightningPopDens BASELINE fates-sci.1.76.3_api.35.1.0-ctsm5.2.007: DIFF
FAIL SMS_Lm1.f45_f45_mg37.I2000Clm60FatesSpCruRsGs.derecho_intel.clm-FatesColdBasic BASELINE fates-sci.1.76.3_api.35.1.0-ctsm5.2.007: DIFF
FAIL SMS_Lm3_D_Mmpi-serial.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdHydro BASELINE fates-sci.1.76.3_api.35.1.0-ctsm5.2.007: DIFF

I need to take a look at these to see if they make sense, but there two that jump out at me at first glance. The ERP_Ld9.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdAllVars and ERS_Lm12.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesFireLightningPopDens tests returns huge DIFFS in FATES_MORTALITY_PF due to the new values being huge (i.e. 1e+38). Spitfire is set to mode 1 in the AllVars test and tree damage is on as well. Spitfire is set to mode 4.

rgknox commented 5 months ago

The dynamics time-scale complex variables were not being flushed and zero'd correctly, this was corrected, the MORTALITY_PF variable is an example of that

glemieux commented 5 months ago

Per our conversation, I went back and looked and realized I misread the cprnc output. I keep thinking the forth line below is the current commit, but it's actually the baseline:

FATES_MORTALITY_PF   (lon,lat,fates_levpft,time)  t_index =      2     2
       12996    39744  (    33,     2,     1,     1) (    23,    21,    10,     1) (    33,     2,     1,     1) (    33,     2,     1,     1)
                12996   1.225934268493151E-01          2.799785209598797E-03 1.1E+38  1.225934268493151E-01 1.0E+00  1.225934268493151E-01
                12996   1.050000000000002E+38          1.050000000000002E+38          1.050000000000002E+38          1.050000000000002E+38
                39744  (    33,     2,     1,     1) (    33,     2,     1,     1)
           avg abs field values:    6.526914797498996E-02    rms diff: 1.0E+38   avg rel diff(npos):  1.0E+00
                                    1.050000000000126E+38                        avg decimal digits(ndif):  0.0 worst:  0.0
RMS FATES_MORTALITY_PF               1.0500E+38            NORMALIZED  2.0000E+00

So this is indeed corrected as you explained. I think we can integrate.

For postertity, the test location on derecho: /glade/u/home/glemieux/scratch/ctsm-tests/tests_pr1208-fates

NGEET / fates