Closed spencerkclark closed 2 years ago
We likely will want to handle configuring to start from coarsened restart files in a specialized manner.
It looks like we already do in fv3net (see here), so maybe this isn't too much of a concern. There we use the force_date_from_namelist
flag in the fortran model, which forces a different code path in fv3config that sets the diag_table
base date to the current date, which we provide through the namelist. If the current date cannot be found in the namelist, fv3config sets the date to all zeros, which equivalently signals to FV3GFS that this is the first segment of the run (see here).
Well it makes sense then that we would get a different outcome in recent C384 runs (crash vs. no crash) by switching the segment length from one day to 3 hours.
Just trying to understand the scope of the bug -- basically it is that at the end of the first segment when the restart files are written the restarts themselves are actually fine, but when the second segment is initialized the diag table date error is introduced, such that the radiation scheme (among other things) will run differently in the second segment such that your test of restarts written at the end of the second segment fails against those from a single run spanning both?
Just trying to understand the scope of the bug -- basically it is that at the end of the first segment when the restart files are written the restarts themselves are actually fine, but when the second segment is initialized the diag table date error is introduced, such that the radiation scheme (among other things) will run differently in the second segment such that your test of restarts written at the end of the second segment fails against those from a single run spanning both?
Yes, exactly.
Introduction
After some debugging, I traced the restart reproducibility issue we were experiencing to how
fv3config
writes diagnostics tables. When writing adiag_table
, FV3GFS requires a two-line header. The first line gives a name to the simulation -- it gets attached as metadata to some output files. The second line is a date. How we treat this date has an important influence on how the model behaves, because of the code here:Time_init
andTime_atmos
are key variables in initializing the model:Time_init
is intended to represent the time that a simulation was first initialized -- in the case of a segmented run, this is the time of the start of the entire simulation, not the start of segment.Time_atmos
is the current time of the run -- in the case of a segmented run, this is the time at the start of the segment. In the case of a segmented run, this is read in from the third line ofcoupler.res
restart file.Time_init
is set based on the value ofdate_init
. What the code above shows is thatdate_init
-- while it can be set earlier incoupler_main.F90
based on its value in thecoupler.res
restart file -- is always overridden by the date in the diagnostics table (this is what theget_base_date
subroutine does).What
fv3config
currently doesWhen writing the
diag_table
in the case of a segmented run,fv3config
will use the date from the third line of thecoupler.res
restart file (the current date rather than the initial date): https://github.com/ai2cm/fv3config/blob/544eaf1bc6f1c4617cd8ee6bd3298136ed180f4c/fv3config/config/derive.py#L74-L91Why does this lead to irreproducible restarts?
There may be other downstream impacts of this issue, but the first place it shows up is in the computation of the solar hour. It depends on the amount of time between the initialization time of the simulation and the current time,
Model%phour
. For segment lengths that evenly divide 24 hours, this may not be a problem; however, it is where short restart reproducibility tests -- which preferably have segment lengths of less than a day -- get tripped up first.Model%phour
and the initialization date are also used in the surface cycling, and the initialization time is used to set the value of a random seed for a subgrid cloud scheme, which could be other sources of irreproducibility in segmented runs.Basically the model expects the initialization time to stay constant, but we are changing it in every segment of segmented runs.
Proposed solution
The natural solution, which requires no changes to the fortran model, is to address this in
fv3config
by using the date from the second line ofcoupler.res
instead of the third line when writing thediag_table
. This way the simulation's initialization date is always propagated through thecoupler.res
restart files until the end of the run. The initialization date would then be constant for thediag_manager
, which I think is more in line with what it expects as well.For reference, GFDL runscripts manage this by having the initialization date defined as a constant, which is always used when writing the
diag_table
to the run directory in subsequent segments.Implications
coupler.res
file (because this initialization time corresponds to what it was in the fine-resolution run). If we set it to what it was in thecoupler.res
file we would be limited to starting the model only from certain timesteps due to the issue above; the solution in this case would be to do what we do now, i.e. set the date indiag_table
to the time associated with the set of restart files -- but only when first initializing the run -- and then follow the reproducible restart pathway for the rest of the simulation.