ai2cm / fv3config

Manipulate FV3GFS run directories
Apache License 2.0
1 stars 0 forks source link

Current `diag_table` writing procedure prevents reproducible restarts #147

Closed spencerkclark closed 2 years ago

spencerkclark commented 2 years ago

Introduction

After some debugging, I traced the restart reproducibility issue we were experiencing to how fv3config writes diagnostics tables. When writing a diag_table, FV3GFS requires a two-line header. The first line gives a name to the simulation -- it gets attached as metadata to some output files. The second line is a date. How we treat this date has an important influence on how the model behaves, because of the code here:

    call diag_manager_init (TIME_INIT=date)

!----- always override initial/base date with diag_manager value -----

    call get_base_date ( date_init(1), date_init(2), date_init(3), &
                         date_init(4), date_init(5), date_init(6)  )

!----- use current date if no base date ------

    if ( date_init(1) == 0 ) date_init = date

!----- set initial and current time types ------

    Time_init  = set_date (date_init(1), date_init(2), date_init(3), &
                           date_init(4), date_init(5), date_init(6))

    Time_atmos = set_date (date(1), date(2), date(3),  &
                           date(4), date(5), date(6))

Time_init and Time_atmos are key variables in initializing the model:

Time_init is set based on the value of date_init. What the code above shows is that date_init -- while it can be set earlier in coupler_main.F90 based on its value in the coupler.res restart file -- is always overridden by the date in the diagnostics table (this is what the get_base_date subroutine does).

What fv3config currently does

When writing the diag_table in the case of a segmented run, fv3config will use the date from the third line of the coupler.res restart file (the current date rather than the initial date): https://github.com/ai2cm/fv3config/blob/544eaf1bc6f1c4617cd8ee6bd3298136ed180f4c/fv3config/config/derive.py#L74-L91

Why does this lead to irreproducible restarts?

There may be other downstream impacts of this issue, but the first place it shows up is in the computation of the solar hour. It depends on the amount of time between the initialization time of the simulation and the current time, Model%phour. For segment lengths that evenly divide 24 hours, this may not be a problem; however, it is where short restart reproducibility tests -- which preferably have segment lengths of less than a day -- get tripped up first. Model%phour and the initialization date are also used in the surface cycling, and the initialization time is used to set the value of a random seed for a subgrid cloud scheme, which could be other sources of irreproducibility in segmented runs.

Basically the model expects the initialization time to stay constant, but we are changing it in every segment of segmented runs.

Proposed solution

The natural solution, which requires no changes to the fortran model, is to address this in fv3config by using the date from the second line of coupler.res instead of the third line when writing the diag_table. This way the simulation's initialization date is always propagated through the coupler.res restart files until the end of the run. The initialization date would then be constant for the diag_manager, which I think is more in line with what it expects as well.

For reference, GFDL runscripts manage this by having the initialization date defined as a constant, which is always used when writing the diag_table to the run directory in subsequent segments.

Implications

spencerkclark commented 2 years ago

We likely will want to handle configuring to start from coarsened restart files in a specialized manner.

It looks like we already do in fv3net (see here), so maybe this isn't too much of a concern. There we use the force_date_from_namelist flag in the fortran model, which forces a different code path in fv3config that sets the diag_table base date to the current date, which we provide through the namelist. If the current date cannot be found in the namelist, fv3config sets the date to all zeros, which equivalently signals to FV3GFS that this is the first segment of the run (see here).

brianhenn commented 2 years ago

Well it makes sense then that we would get a different outcome in recent C384 runs (crash vs. no crash) by switching the segment length from one day to 3 hours.

Just trying to understand the scope of the bug -- basically it is that at the end of the first segment when the restart files are written the restarts themselves are actually fine, but when the second segment is initialized the diag table date error is introduced, such that the radiation scheme (among other things) will run differently in the second segment such that your test of restarts written at the end of the second segment fails against those from a single run spanning both?

spencerkclark commented 2 years ago

Just trying to understand the scope of the bug -- basically it is that at the end of the first segment when the restart files are written the restarts themselves are actually fine, but when the second segment is initialized the diag table date error is introduced, such that the radiation scheme (among other things) will run differently in the second segment such that your test of restarts written at the end of the second segment fails against those from a single run spanning both?

Yes, exactly.