WIP: SHiELD-wrapper prognostic run

spencerkclark commented 1 year ago

This is a work in progress PR adding a SHiELD-wrapper-based prognostic run. I will add more details as I clean it up / flesh it out. Prognostic run tests run locally in the prognostic_run_shield image, but I have not hooked them into CI just yet. Some debugging still needs to be done to get all the tests passing reproducibly.

Coverage reports (updated automatically):

spencerkclark commented 1 year ago

Some debugging still needs to be done to get all the tests passing reproducibly.

The puzzling finding after today is that the mere act of loading an ML model into memory (here) leads to non-reproducible results in the predictor test case with the SHiELD wrapper (here we have made sure to comment out everything related to the model that happens downstream of loading it, i.e. making predictions and setting state, so those do not appear to be the issue). I'm assuming something similar is going on related to the nudging case as it relates to loading data from reference netCDF files, though I have not had a chance to verify that.

The notable thing is that a baseline simulation (i.e. one that does not use Python at all for loading in data) produces consistently reproducible results with the same base model configuration and prognostic run time loop.

Tomorrow I will try to work on putting together a minimal example that illustrates this with a simpler Python run file, which may make iteration a little easier and faster.

spencerkclark commented 1 year ago

I was able to put together a minimal example, but it in and of itself was not particularly more illuminating than the full test cases. Given that answers were sensitive to ancillary runtime actions, my hypothesis was that there were possibly uninitialized variables in SHiELD. A way to test this is to try compiling with compiler flags that force real, integer, and logical fields to be initialized with specific values.

We did this in two ways. One was with setting uninitialized fields with values that we would expect to cause problems in the model, i.e. NaNs for reals, large values for integers, and True values for logicals:

FFLAGS += -finit-real=nan -finit-integer=9999999 -finit-logical=true

This indeed caused the model to crash, suggesting that uninitialized values existed and were influencing the simulation with the configuration I am using.

The other was with setting uninitialized fields with values that we would implicitly assume uninitialized values might take, i.e. zeros for reals and integers, and False values for logicals:

FFLAGS += -finit-real=zero -finit-integer=0 -finit-logical=false

With these compiler flags, all the prognostic run tests run in a reproducible fashion. The conclusion is therefore that uninitialized values in SHiELD are the culprit. The next step is tracking down where those uninitialized values are.

This strategy was inspired by the debug mode flags in the configure files of the fv3gfs-fortran repo; there initializing with nonsense values does not crash the model, suggesting that uninitialized fields are not an issue in that code base. These flags are notably not used in debug mode in the SHiELD build system, so it is possible some uninitialized fields have entered unnoticed. It is possible this issue was also playing a role in https://github.com/NOAA-GFDL/SHiELD_build/issues/25.

spencerkclark commented 1 year ago

It appears that the uninitialized values crop up specifically when fv_core_nml.kord_wz = -9. If I switch this to fv_core_nml.kord_wz = 9 then simulations complete in the case that I initialize all real fields with NaNs. The sign of kord_wz controls whether the mode is set to -2 or -3 in the cs_profile subroutine when remapping the vertical velocity (see here); apparently the -3 mode case is the problematic one.

This provides a way forward for these tests, but it still would be good to get to the bottom of. I'm not sure if this is just some flakiness introduced when running at C12 resolution and 63 vertical levels, or if there is possibly something else going on. The motivation of using fv_core_nml.kord_wz = -9 is that it was the value of the parameter used in the PIRE simulations, which I was hoping to codify in test and reference configurations here in fv3net.

spencerkclark commented 1 year ago

I've confirmed locally that switching to fv_core_nml.kord_wz = 9 allows the prognostic run tests with the SHiELD wrapper to pass reproducibly with the PIRE-like (modulo this one parameter) configuration. I will proceed with cleaning up this PR next week, and dig into the fv_core_nml.kord_wz issue on the side, since it seems unrelated to any code in fv3net or SHiELD-wrapper.

spencerkclark commented 1 year ago

I was bored while charging my car this morning and decided to carefully work through the code in cs_profile. The issue is that when iv == -3, the array gam is not assigned a value for the lowermost model level (see here), prior to being used here. I've confirmed this is the source of the issue with some print debugging, and then by manually setting gam(:,km) to an arbitrary value and re-running a test case when compiling with -finit-real=nan; the test case completes when it would otherwise crash. I will post a more descriptive issue in the NOAA-GFDL/GFDL_atmos_cubed_sphere repository on Monday.

lharris4 commented 1 year ago

Great find @spencerkclark . This could also tie into some other mysterious crashes we have seen. Will check the revised posting ASAP.

spencerkclark commented 1 year ago

Thanks @lharris4!

For anyone else curious, further discussion of the non-reproducibility issue will take place here: NOAA-GFDL/GFDL_atmos_cubed_sphere#301.

spencerkclark commented 11 months ago

This PR has been split / cleaned up into #2365, #2376, and #2377. I think it is safe to close at this point.

ai2cm / fv3net

WIP: SHiELD-wrapper prognostic run #2350