Current `diag_table` writing procedure prevents reproducible restarts

spencerkclark commented 2 years ago

Introduction

After some debugging, I traced the restart reproducibility issue we were experiencing to how fv3config writes diagnostics tables. When writing a diag_table, FV3GFS requires a two-line header. The first line gives a name to the simulation -- it gets attached as metadata to some output files. The second line is a date. How we treat this date has an important influence on how the model behaves, because of the code here:

    call diag_manager_init (TIME_INIT=date)

!----- always override initial/base date with diag_manager value -----

    call get_base_date ( date_init(1), date_init(2), date_init(3), &
                         date_init(4), date_init(5), date_init(6)  )

!----- use current date if no base date ------

    if ( date_init(1) == 0 ) date_init = date

!----- set initial and current time types ------

    Time_init  = set_date (date_init(1), date_init(2), date_init(3), &
                           date_init(4), date_init(5), date_init(6))

    Time_atmos = set_date (date(1), date(2), date(3),  &
                           date(4), date(5), date(6))

Time_init and Time_atmos are key variables in initializing the model:

Time_init is intended to represent the time that a simulation was first initialized -- in the case of a segmented run, this is the time of the start of the entire simulation, not the start of segment.
Time_atmos is the current time of the run -- in the case of a segmented run, this is the time at the start of the segment. In the case of a segmented run, this is read in from the third line of coupler.res restart file.

Time_init is set based on the value of date_init. What the code above shows is that date_init -- while it can be set earlier in coupler_main.F90 based on its value in the coupler.res restart file -- is always overridden by the date in the diagnostics table (this is what the get_base_date subroutine does).

What `fv3config` currently does

When writing the diag_table in the case of a segmented run, fv3config will use the date from the third line of the coupler.res restart file (the current date rather than the initial date): https://github.com/ai2cm/fv3config/blob/544eaf1bc6f1c4617cd8ee6bd3298136ed180f4c/fv3config/config/derive.py#L74-L91

Why does this lead to irreproducible restarts?

There may be other downstream impacts of this issue, but the first place it shows up is in the computation of the solar hour. It depends on the amount of time between the initialization time of the simulation and the current time, Model%phour. For segment lengths that evenly divide 24 hours, this may not be a problem; however, it is where short restart reproducibility tests -- which preferably have segment lengths of less than a day -- get tripped up first. Model%phour and the initialization date are also used in the surface cycling, and the initialization time is used to set the value of a random seed for a subgrid cloud scheme, which could be other sources of irreproducibility in segmented runs.

Basically the model expects the initialization time to stay constant, but we are changing it in every segment of segmented runs.

Proposed solution

The natural solution, which requires no changes to the fortran model, is to address this in fv3config by using the date from the second line of coupler.res instead of the third line when writing the diag_table. This way the simulation's initialization date is always propagated through the coupler.res restart files until the end of the run. The initialization date would then be constant for the diag_manager, which I think is more in line with what it expects as well.

For reference, GFDL runscripts manage this by having the initialization date defined as a constant, which is always used when writing the diag_table to the run directory in subsequent segments.

Implications

Run segment lengths must now evenly divide the radiation timesteps. The radiative transfer code must run during the first timestep of a segment to initialize some variables. If it does not, a segmentation fault will occur in the physics. Previously, the radiative transfer code would always run on the first timestep of a segment, regardless of whether it meant that the radiation would be called sooner than it would have been in a continuous run (this was another side of effect of always setting the initialization time to the start of the segment).
We likely will want to handle configuring to start from coarsened restart files in a specialized manner. This is different than a segmented run in that we do not necessarily want to set the initialization date to what it is in the coupler.res file (because this initialization time corresponds to what it was in the fine-resolution run). If we set it to what it was in the coupler.res file we would be limited to starting the model only from certain timesteps due to the issue above; the solution in this case would be to do what we do now, i.e. set the date in diag_table to the time associated with the set of restart files -- but only when first initializing the run -- and then follow the reproducible restart pathway for the rest of the simulation.

spencerkclark commented 2 years ago

We likely will want to handle configuring to start from coarsened restart files in a specialized manner.

It looks like we already do in fv3net (see here), so maybe this isn't too much of a concern. There we use the force_date_from_namelist flag in the fortran model, which forces a different code path in fv3config that sets the diag_table base date to the current date, which we provide through the namelist. If the current date cannot be found in the namelist, fv3config sets the date to all zeros, which equivalently signals to FV3GFS that this is the first segment of the run (see here).

brianhenn commented 2 years ago

Well it makes sense then that we would get a different outcome in recent C384 runs (crash vs. no crash) by switching the segment length from one day to 3 hours.

Just trying to understand the scope of the bug -- basically it is that at the end of the first segment when the restart files are written the restarts themselves are actually fine, but when the second segment is initialized the diag table date error is introduced, such that the radiation scheme (among other things) will run differently in the second segment such that your test of restarts written at the end of the second segment fails against those from a single run spanning both?

spencerkclark commented 2 years ago

Just trying to understand the scope of the bug -- basically it is that at the end of the first segment when the restart files are written the restarts themselves are actually fine, but when the second segment is initialized the diag table date error is introduced, such that the radiation scheme (among other things) will run differently in the second segment such that your test of restarts written at the end of the second segment fails against those from a single run spanning both?

Yes, exactly.

ai2cm / fv3config