ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)
http://www.cesm.ucar.edu/models/cesm2.0/land/
Other
296 stars 301 forks source link

Allow nlevgrnd less than nlevurb #674

Closed ekluzek closed 3 years ago

ekluzek commented 5 years ago

Brief summary of bug

There are some bits of code that address the nlevurb level of arrays that are dimensioned by nlevgrnd. Thus an implicit assumption that nlevgrnd >= nlevurb.

@olyson @barlage

General bug information

CTSM version you are using: ctsm1.0.dev031

Does this bug cause significantly incorrect results in the model's science? No Configurations affected: Non-standard configurations where you set nlevgrnd to be less than nlevurb. We want to use this for the NWP configuration where we want to use 4 levels with nlevurb being 5.

Details of bug

This dependence is especially true of the older simple building temperature. And we likely won't fix that. We should make sure the code aborts for that case though.

For newer cases, I think we can remove this dependence.

billsacks commented 5 years ago

@ekluzek I just reworded the title to make it sound more like an enhancement.

billsacks commented 5 years ago

From a quick look through the code, I think it will take some care to ensure that we find and fix all relevant places. For example, consider this loop:

https://github.com/ESCOMP/ctsm/blob/e96cce8cbfc2e00175e224849ea174145313cf01/src/biogeophys/SoilFluxesMod.F90#L375-L389

This loop seems to implicitly assume that nlevurb <= nlevgrnd (if nlevurb > nlevgrnd, not all urban layers would be handled currently). We'll need to be vigilant to find and fix any similar looping structures.

billsacks commented 5 years ago

@olyson can you please take this on at some point?

olyson commented 5 years ago

Ok. Start by branching off of master?

ekluzek commented 5 years ago

Yes, we want this to go onto master. So starting with latest master would be good.

olyson commented 4 years ago

I've gone through the code and made appropriate changes to remove this assumption (I think). I'm still getting bfb for a test case (with nlevgrnd>nlevurb; ERP_Ld3.f09_g17.I1850Clm50BgcCropCru.cheyenne_intel.clm-ciso). I'd like to try running with nlevgrnd<nlevurb to see if I've found all of the assumption instances. Is that possible with the code base that I'm using (ctsm1.0.dev035)? Can I just set nlevgrnd to, e.g., 4, somewhere, or is there an NWP configuration that I need to use?

billsacks commented 4 years ago

Thank you @olyson !

It's possible that you could just set nlevgrnd to 4, but the easiest thing to do may be to start with the NWP configuration and tweak that. The NWP configuration is supported starting with ctsm1.0.dev038, so you'd need to merge your branch up to that (hopefully fairly easy since it's just a few tags ahead of your starting point). Then you could use a compset like I2000Ctsm50NwpSpGswpGs (with any grid). Note that this is an SP case: in that version of the code, I believe NWP and BGC are incompatible. Then search the code for 5SL_3m and change it to actually use nlevgrnd = 4 rather than 5 (changes are needed in clm_varpar.F90 and initVerticalMod.F90).

Actually, probably it's a good idea to first reproduce that you get an error with this configuration from this code version without your changes, before verifying that your code changes make this problem go away. (Because I'm not sure exactly what configuration generated the error before.)

If any of this is unclear, I'm happy to help.

olyson commented 4 years ago

Thanks, that all makes sense, I'll give it a try.

olyson commented 4 years ago

I tried compset I2000Ctsm50NwpSpGswpGs with a 1deg global grid with ctsm1.0.dev038. This ran for 5 days. I then tried changing nlevsoi and nlevgrnd to 4 in clm_varpar.F90, and removed dzsoi(5)=1.0_r8 in initVerticalMod.F90. This simulation crashed right away with a bunch of large snow balance errors (mostly associated with glacier columns but some with vegetated landunits). I tried this with my merged code and got the same types of errors. I think I will investigate these large snow balance errors first in ctsm1.0.dev038 to see what the cause is.

olyson commented 4 years ago

Actually, looking more closely, the snow balance errors are only warnings (because DAnstep <= skip_steps) and the model is crashing because of a NaN passed to the coupler.

olyson commented 4 years ago

I got more information from debug mode. There is still a bounds problem in initVerticalMod.F90 where, e.g., col%z for urban columns is assumed to have vertical dimension of nlevurb (5), but is assigned a dimension of nlevgrnd (4). Since col%z and other similar arrays are global, one way to handle this is to assign col%z to spval for soil columns when nlevgrnd < nlevurb (something similar is done for col%z for urban points when nlevgrnd > nlevurb), although this seems like it has the possibility of propagating unnecessarily through the code and increase memory requirements because there are a great deal more soil than urban points. Another option might be to use explicit arrays for urban similar to how lakes are handled (zlake, dzlake) I had been focused on the biogeophysics code itself to handle nlevgrnd < nlevurb and hadn't noticed this fundamental problem.

olyson commented 4 years ago

Turns out there are a lot of arrays that are shared between urban points and other landunit types. I've chased these down and essentially modified them with arrays dimensioned by the maximum of nlevgrnd and nlevurb. This runs for a single point urban case and a single point veg case, as well as a global NWP case, all with nlevgrnd=nlevsoi=4 and nlevurb=5, but only for a cold start.
Doesn't work using an interpolated initial file because the variables associated with levtot (T_SOISNO, H2OSOI_LIQ, H2OSOI_ICE) are dimensioned by levtot = 4+5 = 9 when we need it to be 10, and the variables associated with levgrnd (LEVGRND_CLASS, LEVGRND_CLASS_p, COL_Z, COL_Z_p, SMP, HK) are dimensioned by levgrnd = 4 when we need it to be 5.
Although I can make it work by making a few more modifications to the restart routines. And ignore a couple of checks in the interpolation code that normally exit on special values for the level_class. In the end, although this is working in my test cases, this approach is pretty invasive in the code. The other approach, creating explicit arrays for urban, would be pretty invasive also though.

billsacks commented 4 years ago

From discussion with @olyson here is a path forward:

(1) Keith will rework some of his changes to introduce a new variable in the code and on the restart file, named something like nlev_shared, nlevmax, nlevmaxurbgrnd, or something like that. He can use this in place of a lot of places in the code that currently reference max0(nlevgrnd, nlevurb).

(2) Keith will share with me the test case he has been running to confirm that he can get nlevgrnd < nlevurb to work, and which also demonstrated problems with interpolation.

(3) I will add a test to the test suite that mimics (2).

(4) I will make any changes needed to get init_interp working in the case where nlevgrnd < nlevurb. Once this is done, the new test (3) should pass.

(5) I will rework code to remove duplication. This is particularly relevant in SoilTemperature, where on Keith's branch there are now separate loops over nlevgrnd and nlevurb that have similar or identical bodies. I will either merge these back into a single loop, or extract the bodies into shared subroutines in order to avoid having large blocks of duplicated code. This rework should preserve answers for the new test as well as standard tests in our test suite.

@olyson can you think of any relevant points from our discussion that I'm missing here?

olyson commented 4 years ago

I guess we'll have to think through how to ensure backward compatibility with restart files that won't have this new variable on it. I think I will use nlevmaxurbgrnd since I feel that is the most descriptive. Other than that, thanks for the nice summation.

olyson commented 4 years ago

I've substituted in nlevmaxurbgrnd for max0(nlevgrnd,nlevurb) everywhere. I've also changed bunch of arrays from nlevmaxurbgrnd back to nlevgrnd, testing indicated that the re-dimensioning wasn't needed. The case that fails with an finidat is here:

/glade/work/oleson/ctsm_nlevurbgrnd/cime/scripts/I2000Ctsm50NwpSpGswpGs_nlevurbgrnd

The error message is here:

/glade/scratch/oleson/I2000Ctsm50NwpSpGswpGs_nlevurbgrnd/run/cesm.log.8715431.chadmin1.ib0.cheyenne.ucar.edu.191004-135256

spvals or NaNs found in coordinate array where level_class /= ispval; this is currently unhandled ERROR in initInterpMultilevelInterp.F90 at line 435

Otherwise, it runs and restarts using a cold start.

olyson commented 4 years ago

I've now addressed 5) by reworking SoilTemperatureMod to eliminate duplicate code. The code is still bfb with ctsm1.0.dev038 with the NWP 5SL_3m configuration (cold start).

billsacks commented 3 years ago

@olyson - My sincere apologies that this has been sitting in a mostly-done state for the last year. I am finally returning to this and can do the last pieces: a final review, maybe a bit of final cleanup and (the biggest piece) getting init_interp working.

But before I do this, I wanted to check in to confirm that this is still wanted, and that the branch you pointed me to a year ago (https://github.com/olyson/ctsm/tree/nlevurbgrnd ) is still what we want.

olyson commented 3 years ago

No worries...Yes, we still want this and the branch should still apply, thanks.