ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)
http://www.cesm.ucar.edu/models/cesm2.0/land/
Other
308 stars 312 forks source link

onset_counter variable on (all?) restart files is not compatible with subsequent model runs using a 20 minute time step #1163

Closed olyson closed 4 years ago

olyson commented 4 years ago

Brief summary of bug

The onset_counter variable on restart files is not compatible with subsequent model runs using a 20 minute time step and eventually causes a negative leaf carbon error.

General bug information

CTSM version you are using: cesm2_2_beta05

Does this bug cause significantly incorrect results in the model's science? Yes (in the sense that the model crashes)

Configurations affected: CLM50 BGC configurations that use a 20-minute timestep and a restart file from a model run that had a 1/2 or hour time step.

Details of bug

A negative leaf carbon balance error (leaf carbon exceeds the threshold of -60) was reported by Joe McInerney for this case that uses a 20 minute time step:

/glade/u/home/joemci/cesm/cesm2_2_beta05_cases/b.cesm2_2_beta05.B1850.f09_g17_ts20spl0macmic2mg1.001

I replicated this error with my own case:

/glade/work/oleson/cesm2_2_beta05/cime/scripts/clm50_cesm22beta05_1deg_GSWP3V1_20mintstep_1850

The error occurs at year 1, month 2, day 5 and is associated with broadleaf deciduous shrub.

On the other hand the same fully-coupled run with a time step of 15 minutes ran for one year successfully.

I ran a land-only 1deg 1850 CLM5-BGC-Crop simulation with a 20 minute time step and it crashed with a similar leaf carbon balance error, interestingly nearly at the same time as the fully coupled simulation. A land-only simulation with a 15 minute time step ran for 40 years successfully. So I proceeded with debugging the land-only simulation with a 20 minute time step.

It appears that the source of the problem is that the onset_counter from the restart file is always a multiple of 1800 seconds but the model time step is 1200 seconds. There is code in CNPhenology that relies on the onset_counter being zero. The onset_counter is decremented by the model time step. As such, the onset_counter will never be zero and the onset period will never end and the transfer growth rates (in particular, leafc_xfer_to_leafc) will never be set to zero. Leaf carbon becomes increasingly negative until the negative carbon threshold is reached and the run aborts. There may be a similar problem due to code that depends on offset_counter being equal to dt, which doesn't seem as if it could occur in this case.

Important details of your setup / configuration so we can reproduce the bug

--compset I1850Clm50BgcCrop with ATM_NCPL set to 72.

billsacks commented 4 years ago

Nice work tracking this down, @olyson ! Looking quickly at the code, it looks like the fix may be to change the checks of equality to 0 to instead check something like if (onset_counter < dtime/2) - i.e., if we're within 1/2 time step of onset_counter reaching 0. Same for checks of offset_counter exactly equal to 0. Something similar could probably be done with the checks of onset_counter and offset_counter exactly equal to dtime... I'm not sure of this, but maybe this could similarly check something like if (abs(onset_counter - dtime) < dtime/2)??? @olyson is this similar to what you are thinking?

ekluzek commented 4 years ago

@olyson thanks for finding this problem. So is the solution to keep track of what the time-step was for the restart file, and then adjust to the new time-step on startup? I like @billsacks solution as well.

billsacks commented 4 years ago

In thinking about the right solution, I think it's also worth considering a requirement that may be coming in relatively soon (for WRF coupling): we may need to be able to accommodate dynamically varying time step lengths. Something like what I suggested feels like it should work for that, although it may need a little more thought. One thing I'm especially not sure about is the code that checks for equality with dtime: If the code in those conditionals absolutely must be done, and absolutely must be done only once, then we'd need to be very careful about how these conditionals look; if, on the other hand, these are more along the lines of "nice to have, but if you skip it or do it twice it's not a big deal" then we can be less rigorous.

olyson commented 4 years ago

Another possible fix I was thinking of is to reset the offset_counter to the closest multiple of the new model time step after the initial file is read in. Then no code in CNPhenology would need to be changed. But yes I was planning also in investigating changing code in CNPhenology as you have suggested. Need to make sure we don't change answers for cases where dtime=1800.

ekluzek commented 4 years ago

I also wonder if there are other things that are sensitive to the time step as well. Another thing that we could do is to not allow a time-step change to happen for a restart or branch -- but only for a startup. Any restart variables that have seconds in the units might be implicitly sensitive to the time-step. In other cases they might just be wrong for time-step 0, but adjust fine afterwards.

And I wonder if we should add a test for this problem as well? The waccmx_offline test does change the time-step, but it must be one that works out. So I really thing we should add a new test that changes the timestep here.

billsacks commented 4 years ago

@olyson your proposed fix also sounds like a reasonable solution. Part of what led me to my suggestion is that checking exact equality of real values always makes me nervous. I think it's actually okay in this case because time step is always an integer number of seconds (but stored as a double), and integers are exactly representable as doubles up to pretty large values. But it still raises my blood pressure to see exact equality comparisons on reals like this, so I was thinking we could change it for the sake of my long-term health :-)

Also @olyson and @ekluzek I should clarify my last comment: my understanding of the possible WRF requirement is that time step could actually change mid-run, not just at restart, so we'd need code that is robust to this without relying on any adjustments happening at restart.

@ekluzek Yes, I would bet a lot of money that we have other code that is sensitive to time step, but I don't think it has to do with having seconds in the units. Instead, I think it will mostly be cases like this one, where we calculate some length of time and then have some counter that is incremented / decremented, where the assumption is that time step stays fixed over that period. (The irrigation code that I wrote is a likely culprit, for example: it irrigates over 6 hours.) And then a test for changing time step mid-run makes sense at some point, but probably not until we think it will actually work.

ekluzek commented 4 years ago

@billsacks thanks for that clarification. If it changes mid-run, we'd want to make sure this code is put into a subroutine that could be called at anytime. And of course that would be good to do anyway.

And yes, my Numerical Analysis class from long ago, said to NEVER use an exact equality for floating point numbers -- always use an inequality. So that is a best practice, and yes we should eliminate exact equalities -- and not just for @billsacks blood pressure. But, that's not a bad thing either. :-) You are right that you can sometimes get away with it, but still shouldn't be something that we do.

It does seem like the example you give would be found for restart variables with time in the units. I'm just trying to think of a general way of looking at restart variables that might have a problem. Looking at variables with time in their units would be one way to do it. There's likely derivatives that are off as well (and they'd have time in their units), but likely don't matter if it's only off for a single time-step. But, we could take a closer look at any restart variables with time in their units and see if there's a problem with them.

olyson commented 4 years ago

I guess I'm now leaning toward changing code in CNPhenology for this particular problem. Putting in code to adjust the counter probably wouldn't end up being very visible, although it might be advantageous if it could be written generically to handle any other possible variables that need to be adjusted due to a change in the time step. I can easily see if changing CNPhenology actually fixes the problem and can easily test to make sure answers don't change.

billsacks commented 4 years ago

(This comment just has some general, long-term thoughts... it isn't very important for the immediate issue at hand.)

It does seem like the example you give would be found for restart variables with time in the units. I'm just trying to think of a general way of looking at restart variables that might have a problem. Looking at variables with time in their units would be one way to do it. There's likely derivatives that are off as well (and they'd have time in their units), but likely don't matter if it's only off for a single time-step. But, we could take a closer look at any restart variables with time in their units and see if there's a problem with them.

Okay, now I see what you're saying. At first I was thinking that your rule would also capture things like fluxes because they have time in their units (mass per unit time), but now I see that's not what you meant. And I like your idea of examining the restart file to try to identify possibly problematic variables. Even so: I think the problem of changing time steps is going to mainly arise with variables that are expressed as some number of time steps, or something similar to that. So these wouldn't have time in their units, but rather would be unitless.

I realized that a major culprit here is the accumulation routines (in accumulMod). I believe that these won't work right if you change the time step in the middle of an accumulation interval, but I haven't looked closely.

I actually think that the offending code in this issue is close to what we DO want in general: variables expressed in seconds, rather than as a number of time steps. It's just that we need to be careful about not assuming that counters get down to exactly 0 time remaining (or dtime remaining, etc.). The alternative is to store things as number of time steps, but then we need to make sure to adjust every variable like that whenever the time step changes; my feeling is that that will be more error-prone (forgetting to adjust some variable), but it could be worth exploring both possibilities for a couple of examples to see what feels most robust and least error-prone.

olyson commented 4 years ago

I've tried the suggested fixes by @billsacks, they work fine for a 40-year land-only simulation and a 1-year fully-coupled simulation, both using a 20 minute time step. The code mods are here: /glade/work/oleson/cesm2_2_beta05/cime/scripts/clm50_cesm22beta05_1deg_GSWP3V1_20mintstep_1850/SourceMods/src.clm/CNPhenologyMod.F90 A land-only simulation with and without the fix was bfb for a 30 minute time step.

billsacks commented 4 years ago

Great, thanks for the update @olyson . Do you plan to make similar changes for offset_counter?

olyson commented 4 years ago

Well, I completely forgot about the presence of offset_counter! Lost my train of thought over the weekend apparently. So, yes, I'll test similar changes for offset_counter.

olyson commented 4 years ago

Ok, I've made similar changes for offset_counter. I repeated the testing I reported on previously and everything still looks ok.

billsacks commented 4 years ago

Thanks @olyson ! Do you feel this is ready to come to master? If so, I'm happy to do a final review of a PR whenever you get a chance to open one.

olyson commented 4 years ago

I'm running with SourceMods on the cesm2_2_beta05 tag, so let me create a branch off of master and run a few tests from the test suite, then I'll issue a PR.

AnushaNTNU commented 1 year ago

May i know where and how the timestep changes have to be made?

olyson commented 1 year ago

I think you can find the changes here:

https://github.com/ESCOMP/CTSM/commit/08e5a90e715a2134a82be20c79c7c9278a743c15