ESCOMP / CISM-wrapper

Community Ice Sheet Model wrapper for CESM
http://www.cesm.ucar.edu/models/cesm2.0/land-ice/
Other
3 stars 15 forks source link

ERS Test with CISM No-Evolve runs for 3 years at restart instead of 1 #83

Closed Katetc closed 2 months ago

Katetc commented 7 months ago

In the testing for the cismwrap_2_1_97 tag, the test: ERS_D_Ly3.f09_g17_gris4.T1850Gg.derecho_intel.cism-noevolve

Fails with a failure on the base-restart comparison. This happens because instead of restarting at year 3 and running for 1 year, the test restarts at year 3, and runs for 3 more years. The test then attempts to compare a year 0004 history file when the final output is a 0006 history file and fails.

I looked into this for a while, and I have no idea why the test runs for 3 years at the restart. The STOP_N is set to 1 year. I've never seen CESM ignore this before. Other ERS tests all pass (though, notably, all other ERS tests have active CISM). Is there something about a TG compset that, when running with NoEvolve, ignores changes to STOP_N? So strange. I've gone ahead with making the 2_1_97 tag as this seems to be a test issue and not a CISM issue, but I'll make this issue to document it.

billsacks commented 7 months ago

I'm looking into this... I have some ideas. I want to do a little more testing, then will share my thoughts / findings.

billsacks commented 6 months ago

@Katetc and I discussed this a couple of weeks ago and I was supposed to post our findings and thoughts here... but then got pulled away to other things and am just getting back to this today. So, Kate, here's my best recollection of what we discussed... nothing new, but just putting our discussion into writing:

On the surface, the reason why this test is newly failing is that it had been disabled until the most recent CISM-wrapper tag. (A few years ago, I think there were issues with creating the nuopc configuration files for this test - see https://github.com/ESCOMP/CISM-wrapper/issues/60#issuecomment-930445923. These issues have been fixed, so Kate tried enabling this test, but then ran into the issue documented here.)

Going one level deeper: The problem here is that, with NUOPC/CMEPS, the expectation has been that, in noevolve mode, CISM won't ever be called in the run phase. This had been implemented for most compsets, but not for T compsets. I just opened a CMEPS PR that fixes this for T compsets: https://github.com/ESCOMP/CMEPS/pull/425. Kate, as noted in that PR, it would be great if you can confirm that this both fixes things for you (which I have already tested, but it wouldn't hurt to get a second test of it) and that other T compset tests still run and are bit-for-bit (which I have not tested... I'd be surprised if anything broke based on a read of the code, but it would be good to confirm that).

I'm not positive, but I think the reason this caused problems is: In noevolve mode, CISM is set up to not read a restart file by being told that this is an initial run rather than a restart run. (This is done because it doesn't expect to have a restart file to read, since it never executes the run phase.) That's fine if CISM's run phase is never called, but before the above CMEPS fix, CISM's run phase was being called for T compsets in noevolve mode. I think what happened in this case was: The system started up at the beginning of year 3, but CISM started back at the beginning of year 1 (because its time comes from its namelist file, which is pinned to the start of the initial run, not the restart run; it expects to get the restart time from its restart file, but in this case it wasn't reading a restart file). So when CISM first executed the run phase, it hit the loop saying "run until your time matches the run-to time from the driver". CISM saw its time as the start of year 1 and the run-to time as the start of year 4, so it ran 3 additional years after restart instead of the 1 that it should have run.

The path forward for testing that we discussed is:

Whew! That was a long explanation for a one-line fix!

Katetc commented 2 months ago

The issue was fixed in CMEPS PR#425 and brought into cism-wrapper with the cism_wrap_2_1_100 tag.