ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)
http://www.cesm.ucar.edu/models/cesm2.0/land/
Other
306 stars 308 forks source link

Put in sanity checks for longwave radiation and other atmospheric forcing fields #96

Closed ekluzek closed 5 years ago

ekluzek commented 6 years ago

Andy Fox < afox > - 2014-02-18 19:50:11 -0700 Bugzilla Id: 1927 Bugzilla CC: erik, oleson, rfisher, sacks, thoar,

I am trying to move the CLM-DART framework from CLM4.0 to CLM4.5. We rely on using an ensemble of DATMs produced by CAM, these are couple history files with 6hrly values and 2 degree resolution and this has been working well with ICN compsets with a large number of model tags over the last couple of years.

I’m now working with clm tag 4_5_62. Using an ICN compset still causes no issues, however, moving either CLM4.5CN or CLM4.5BGC compsets is causing long wave balance errors for some of the ensemble members. I don’t know what fraction of the different DATMs cause errors, and which don’t, but it is the case that multiple different ones cause the error. I have been testing now with a single instance.

The following case /glade/p/work/afox/cases/clm4_5_62_pmo_member_23

With cesm.log.140218-191607 in /glade/scratch/afox/clm4_5_62_pmo_member_23/run

has an error that reads as

1: Opened file ./clm4_5_62_pmo_member_23.clm2.h0.2000-03.nc to write 1 15: errsoi incorporated into sensible heat in SLakeTemperature: c, (W/m^2):66916 7.812500000000000E-003 15: BalanceCheck: longwave energy balance error nstep = 5139 point =156436 imbalance = 163.744375 W/m2 15: clm model is stopping 15: ENDRUN: called without a message string 15:Abort(1) on node 15 (rank 15 in comm -1006632951): application called MPI_Abort(comm=0xC4000009, 1) - process 15

There is an initial warning about a sensible heat balance error from the lake code (line 1003 of clm4_5_62/models/lnd/clm/src/clm4_5/biogeophys/SLakeTemperatureMod.F90), then an endrun call from BalanceCheck.F90 with longwave radiation. Its not immediately apparent if these are definitely actually related.

This is 100% replicable, but replicating the error is somewhat complex as you have to set the user_nl_datm correctly set the stream files to point to the correct forcing files.

We have a script that will do this for single instance case that will fail with this error

/glade/p/work/afox/DART/models/clm/shell_scripts/CESM1_1_1_setup_pmo_test

ekluzek commented 6 years ago

Bill Sacks < sacks > - 2014-08-28 15:09:40 -0600

Andy, is this still an issue, or have you worked around it? If it's still an issue, we can return this to major importance.

ekluzek commented 6 years ago

Andy Fox < afox > - 2014-08-28 16:10:29 -0600

Bill, I created a surface file with no lakes (thanks to your help) and have used that successfully. However, this isn't really a sustainable way forward for other users of DART with CLM4.5 which I know the DART folks would like to support.

ekluzek commented 6 years ago

Bill Sacks < sacks > - 2014-08-28 16:53:47 -0600

Okay, thanks for the reminder, Andy. I'm returning this to major importance.

ekluzek commented 6 years ago

Erik Kluzek < erik > - 2014-11-20 10:56:45 -0700

Andy can you try either the latest code clm4_5_1_r096 or add in the bug fix for 1717 and see if it works for you now?

ekluzek commented 6 years ago

Tim Hoar < thoar > - 2015-11-03 14:47:22 -0700

(In reply to Erik Kluzek from comment #4)

Andy can you try either the latest code clm4_5_1_r096 or add in the bug fix for 1717 and see if it works for you now?

The bug fix for 1717 was put in LakeHydrologyMod.F90 in clm4_5_1_r096 and does not have the last two lines that bug 1717 indicates would be a good idea.

Things apparently got moved around a bit for CLM4_5 in cesm_1_2_1, in that LakeHydrologyMod.F90 seems to be SLakeHydrologyMod.F90 - which is substantially bigger.

I did my best to add the fix to /glade/u/home/thoar/cesm1_2_1/SourceMods/src.clm/src/clm4_5/biogeophys/SLakeHydrologyMod.F90

I did a clean_build and then a build, verified that the new SLakeHydrologyMod.F90 was used and I it appears to have had no effect.

I am still getting longwave balance errors, but my error message is slightly different than Andy's in that his pretty clearly indemnified LakeTemperature, mine does not:

63306 45:(shr_stream_set) size of filename = 1 63307 45:(shr_stream_set) filename = /glade/p/cesmdata/cseg/inputdata/lnd/clm2/ndepdata/fndep_clm_hist_simyr1849-2006_1.9x2.5_c100 428.nc ... 66589 45: BalanceCheck: longwave energy balance error nstep = 76 point =157575 imbalance = 4.556614 W/m2 66590 45: clm model is stopping 66591 45: ENDRUN:BalanceCheck: longwave energy balance error 66592 45:Abort(1) on node 45 (rank 15 in comm -1006632708): application called MPI_Abort(comm=0xC40000FC, 1) - process 15 66593 45:INFO: 0031-306 pm_atexit: pm_exit_value is 1. 66594 INFO: 0031-251 task 45 exited: rc=1 66595 ERROR: 0031-300 Forcing all remote tasks to exit due to exit code 1 in task 45

ekluzek commented 6 years ago

Bill Sacks < sacks > - 2015-11-03 14:49:46 -0700

Adding Keith Oleson in case he has any insights.

ekluzek commented 6 years ago

Keith Oleson < oleson > - 2015-11-03 15:56:57 -0700

The only thing I can say at the moment is that it doesn't look like Zack's original suggested bug fix for 1717 was put into the code. I.e.,

"I found one error that is not just a typo but also shows up in a rarely exercised portion of the code (most likely a sudden very intense snowstorm-- ~8 cm/hr -- preceded by warmer weather) in SLakeHydrology, where the top lake layer temperature is likely being set incorrectly. I haven't tested this, but after looking it over, I think line 794 should be changed from "t_lake(c,1) = t_lake(c,1) - heatrem/(cpliqdenh2odz_lake(c,1))", to "t_lake(c,1) = tfrz + heatrem/(cpliqdenh2odz_lake(c,1))". So you could make another low-priority bug report for that."

The next comment after that is from Erik regarding a longwave balance error and then suggested fixes for that from Sean.

I don't know if the above from Zack is related to Tim's error though. But I guess you could try it. I will check with Erik/Sean to see why Zack's suggestion wasn't put in. Maybe Sean's changes took care of Zack's suggestion somehow.

ekluzek commented 6 years ago

Tim Hoar < thoar > - 2015-11-03 16:08:15 -0700

(In reply to Keith Oleson from comment #7)

The only thing I can say at the moment is that it doesn't look like Zack's original suggested bug fix for 1717 was put into the code. I.e.,

"I found one error that is not just a typo but also shows up in a rarely exercised portion of the code (most likely a sudden very intense snowstorm-- ~8 cm/hr -- preceded by warmer weather) in SLakeHydrology, where the top lake layer temperature is likely being set incorrectly. I haven't tested this, but after looking it over, I think line 794 should be changed from "t_lake(c,1) = t_lake(c,1) - heatrem/(cpliqdenh2odz_lake(c,1))", to "t_lake(c,1) = tfrz + heatrem/(cpliqdenh2odz_lake(c,1))". So you could make another low-priority bug report for that."

The next comment after that is from Erik regarding a longwave balance error and then suggested fixes for that from Sean.

I don't know if the above from Zack is related to Tim's error though. But I guess you could try it. I will check with Erik/Sean to see why Zack's suggestion wasn't put in. Maybe Sean's changes took care of Zack's suggestion somehow.

I need some clarification, sorry ...

Are you saying I should back out the 20 or so lines from Sean that constitute the fix implemented in clm4_5_1_r096 in favor of Zack's one-liner: t_lake(c,1) = tfrz + heatrem/(cpliqdenh2odz_lake(c,1)) ?

I already tried Sean's solution - and it did not help.

ekluzek commented 6 years ago

Keith Oleson < oleson > - 2015-11-03 16:14:25 -0700

No, I was suggesting you could just substitute in Zack's one-liner. Leave Sean's code alone.

ekluzek commented 6 years ago

Keith Oleson < oleson > - 2015-11-03 20:46:47 -0700

Tim, If you have a case I can clone easily I can take a closer look at this. It will probably have to wait until next week though as we are trying to diagnose problems with CLM5 within the coupled model this week.

Keith

ekluzek commented 6 years ago

Tim Hoar < thoar > - 2015-11-03 21:50:32 -0700

(In reply to Keith Oleson from comment #10)

Tim, If you have a case I can clone easily I can take a closer look at this. It will probably have to wait until next week though as we are trying to diagnose problems with CLM5 within the coupled model this week.

Keith

my case directory is: /glade/p/work/thoar/cases/clm45_121_freerun2 the run directory is /glade/scratch/thoar/clm45_121_freerun2/run

I appreciate the help. I have several people who want to use the multi-instance capability with CLM45 - at least one is BGC, and right now I cannot get it to run when each instance has a separate stream file. My hunch is that (at least) one of the DATM input files (that work fine with CLM4.0) has a situation that causes CLM4.5BGC to blow up. I have not tracked down precisely which ensemble member has what I believe to be the culprit.

ekluzek commented 6 years ago

Andy Fox < afox > - 2015-11-04 10:05:05 -0700

I never managed to figure out which DATMS were causing the issues either - but from what I recall I decided that it was certainly more than a couple, but by no means all of them. I think I speculated it was maybe 10 out of 80 DATMS caused this issue?

ekluzek commented 6 years ago

Keith Oleson < oleson > - 2015-11-09 10:07:14 -0700

I'd like to try to replicate the original balance error. Will this script that Andy pointed to work for me?

/glade/p/work/afox/DART/models/clm/shell_scripts/CESM1_1_1_setup_pmo_test

I assume I would check out clm4_5_62 and then modify the above script for my directories/environment.

ekluzek commented 6 years ago

Tim Hoar < thoar > - 2015-11-09 16:49:46 -0700

OK - I whittled it down to a single instance that (reliably) fails. After a newton bisection process to figure out which of the 80 instances was failing, it is this (drumroll please ...) number 2 ... sigh.

1) CASEROOT /glade/scratch/thoar/clm45bgc_121_pmo2/run 2) RUNDIR /glade/p/work/thoar/cases/clm45bgc_121_pmo2

please clone away. Thanks.

ekluzek commented 6 years ago

Keith Oleson < oleson > - 2015-11-09 21:33:14 -0700

It looks like what is happening is that a negative sabg (solar absorbed) for a lake point is calculated by the albedo routine, because the incoming diffuse solar radiation is negative. The sun is slightly above the horizon (coszen > 0) so the model tries to use the negative incoming solar. This causes a large value for lake surface temperature which generates a large outgoing longwave radiation (e+16) and the model can't quite compensate.

I checked all of the incoming diffuse solar from all of the ensemble members and several of the ensemble members have negative values in at least one grid cell.

Not sure why these forcing files worked with CLM4.0. Maybe there is a check in the CLM4.0 source code that is not in CLM4.5 anymore. Or maybe the CLM4.0 lake model is more forgiving of a negative solar radiation at a single time step than the CLM4.5 lake model is. I assume that the same version of the datm is used to drive both CLM4.0 and CLM4.5.

I could see a few different fixes for this: 1) Simply zero out any negative values in the 6-hourly datm forcing files, 2) In the interpolation of 6-hourly values to 1/2 hourly in the datm code (or whatever time step CLM is using), zero out negative values, 2) Put a check into the SurfaceAlbedo code to zero out negative values before computing an albedo.

It seems to me like the best solution is 1). Although maybe there should also be a check in CLM for negative solar radiation. I don't know why those forcing files would have negative solar radiation. Are those files generated by CAM?

I'll see if I can figure out why CLM4.0 works and CLM4.5 doesn't...it might be a useful exercise...

ekluzek commented 6 years ago

Tim Hoar < thoar > - 2015-11-10 09:53:50 -0700

Those files have origins in CESM/CAM and the CESM coupler. We do have to do some processing to concatenate them into annual files, but we don't do any 'math' on them, so I'm not sure where the negative values are coming from, and the assimilation doesn't screw with the diffuse solar (to my knowledge).

There are 1121 files of this nature. I'm not looking forward to the prospect of hunting down all those values and replacing them with zeros. Solutions 2 or 3 might be a bit faster given the timeframe I'm working with.

I'd just as soon try to figure out where those are read and replace the negative values at that point.

ekluzek commented 6 years ago

Keith Oleson < oleson > - 2015-11-10 11:26:23 -0700

Ok, I will see if I can figure out how to zero out negative values when the datm reads them in.

ekluzek commented 6 years ago

Bill Sacks < sacks > - 2015-11-10 11:28:59 -0700

This discussion makes me feel like we should have some sanity checks in CLM (e.g., in atm2lndMod) that check for reasonable values of all of the forcing fields, so that problems like this can be caught more easily....

ekluzek commented 6 years ago

Keith Oleson < oleson > - 2015-11-10 11:44:27 -0700

Yes, I agree. One problem with forcing fields that appears every once in a while is negative incoming longwave radiation from the atmosphere. This manifests itself as a longwave balance error in the urban model because the urban model has an internal check on longwave balance during the time step. It's not obvious to users what the problem really is. Establishing lower bounds on some fields is probably not too hard, e.g., incoming solar and longwave, humidity, precip should not be negative. Establishing upper bounds would be more challenging. Might also be a challenge for other fields such as temperature where we could be simulating climates far in the past and far in the future that have values that we would consider to be extreme relative to present day.

ekluzek commented 6 years ago

Bill Sacks < sacks > - 2015-11-10 11:55:38 -0700

We could tackle the easiest part first: make sure that things that should be non-negative truly are non-negative.

ekluzek commented 6 years ago

Andy Fox < afox > - 2015-11-10 13:40:38 -0700

Its great you guys have figured this out. Thanks so much for doing this Keith.

I must admit back in the day when I looked in to this I didn't concentrate on errors in the DATMs as they had worked just fine with CLM4.0...

ekluzek commented 6 years ago

Keith Oleson < oleson > - 2015-11-10 14:56:26 -0700

Ok, I have a couple of suggested options. They are really just to get you going with your specific CLM45 simulations, I don't recommend either as permanent bug fixes because I think we need a software engineer who is familiar with the datm code to take a look at this. And as Bill suggested we may want to implement a more robust approach to bounding all of the forcing data under any situation, uncoupled or coupled to an active atmosphere model.

The first is to modify this file as I've done in my SourceMods for your particular case:

/glade/p/work/oleson/cesm1_2_1/scripts/clm45bgc_121_pmo2/SourceMods/src.share/shr_strdata_mod.F90

My changes are marked with "!KO":

!KO SDAT%avFLB(n)%rAttr(:,i) = max(SDAT%avFLB(n)%rAttr(:,i),0._r8) !KO

This ensures that the solar radiation fields are greater than zero before the cosine normalization step to interpolate the 6-hourly data to half-hourly is performed. The cosine normalization is only done for solar fields so we don't have to check the field names here.

The second is to modify this file:

/glade/p/work/oleson/cesm1_2_1/scripts/clm45bgc_121_pmo2/SourceMods/src.share/shr_dmodel_mod.F90

Changes are: !KO if (sfldName == "a2x6h_Faxa_swndr" .or. sfldName == "a2x6h_Faxa_swvdr" .or. & sfldName == "a2x6h_Faxa_swndf" .or. sfldName == "a2x6h_Faxa_swvdf") then av%rattr(k,:) = max(av%rattr(k,:),0._r8) end if !KO

The advantage of this is that the check for negative solar radiation is done right after the data is read in. The disadvantage is that I'm looking for specific atm fields, the four incoming solar fields on your forcing files. NOTE that I'm only doing this here for the case where "pio_iotype" is not "iotype_std_netcdf", which is true for your particular case. The same thing would have to be done for the other situation.

I've tested both of these separately and they both work and get the simulation past the longwave balance error. Ideally we would probably want to identify the negative solar radiation as it was read in, stop the model, and flag this to the user so that the user knows they have spurious incoming solar values.

If anyone can suggest a better temporary fix, let me know, thanks.

ekluzek commented 6 years ago

Tim Hoar < thoar > - 2015-11-13 16:35:42 -0700

Keith, Bill, Andy;

Thank you very much. I tried your first method (modifying shr_strdata_mod.F90) and it got past the part that stymied me before. When I get back from PTO, I will spin up an 80 member CLM4.5 BGC run for 12 years, maybe more. (I have 12 years of DATM forcing).

Again - Thanks very much and I'll keep you posted.

ekluzek commented 6 years ago

Bill Sacks < sacks > - 2016-06-29 16:01:52 -0600

Looking back at this quickly, we want to do three things:

(1) CAM should have some checks to not output negative values for some fluxes

(2) DATM should have similar checks, which convert negative values to 0

(3) CLM should check incoming values for sanity, and die with a meaningful message if values are outside the physically possible range.

(3) has come up a number of times recently. We should at least do this.

ekluzek commented 6 years ago

Bill Sacks < sacks > - 2016-11-07 14:44:04 -0700

Renaming this based on what we still want to do.

ekluzek commented 5 years ago

The error checks for unphysical values has been put into place. We check for negative values that shouldn't be (solar, spec-humidity, LW) as well as values that are NaN. So closing this.