E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
354 stars 368 forks source link

Restart output partial averaging incorrect in ne16 F1850, possibly ne30. #757

Closed abigailgaddis closed 8 years ago

abigailgaddis commented 8 years ago

I ran v0.4-298-ga47f412 with the F 1850 C5 compset at ne16_g37 resolution.

This confluence page gives more detail about the bug hunting process and the model settings used.

I set restarts at every 273 days to get the maximum number of days per 2 hour submit time on Titan. We noticed the model climatology was extremely cold. With the help of Charlie Zender, we saw that this was due to a pattern of sudden drops in the temperature that didn't seem physical. I correlated the drops in temperature to the restart date. The later in the month the restart occurs, the closer to zero the temperature is from the monthly h0 output.

Suspicion: partial averaging is not being recorded correctly in this version of the model. It may be due to averaging values of zero with good values.

This issue also occurs in daily average output if the restart time is set to 36 hours. Comparing a non-restarted 72 hour run to a run with a restart in the middle, the average temperature is exactly the same except on the restart day. Therefore, the zero-averaged data is likely not being read in to the model, but is an output issue. Please see the confluence page or email me for more details that may make your bug hunting easier.

From some grepping in the code related to ndays, a possible culprit is in /ACME/components/cice/src/source/ice_init.F90

susburrows commented 8 years ago

@douglasjacobsen , Abby identified a bug that is breaking restarts, which appears to be related to fields in the CICE model. Could we re-assign this bug to you or someone else on the CICE team to follow up on?

douglasjacobsen commented 8 years ago

@susburrows I'll talk with people here about looking into it, but I'll assign myself for now as well. Thanks for the heads up!

abigailgaddis commented 8 years ago

Thanks @susburrows! I wanted to clarify, too: its possible that it may not be the CICE - just the restart partial averaging of output. Its definitely in temperature output from the atmosphere. Soooo it could be cice as well as atm. But either way its involved in the restart code.

douglasjacobsen commented 8 years ago

Actually, this compset is using the older CICE model, which I know nothing about. I'm going to assign this to @eclare108213 to look at, with the understanding that it might not be a CICE model issue.

rljacob commented 8 years ago

The version of CICE we inherited way back with cesm1_3_beta10 (now in components/cice) is known to not restart correctly so any compset using that model also won't restart correctly. No one in ACME is going to fix cice so you should switch to using components/mpas-cice in all your compsets.

abigailgaddis commented 8 years ago

Would this restart bug cause the air temperature monthly output to be 9K, on average, for the globe in restart months (read: suuuuuper crazy cold)? Or would the cice bug just show up in the exact restart test? We have both effects going on. When we ran with restarts at the first of the month, we did not see these drastic drops in air temperature.

kevans32 commented 8 years ago

Hey- As Abby alluded, I think we are mixing CICE bugs in this thread. Although the CICE had a failure with the restart test (bug 1) the bigger issue is another bug in the averaging of variables that go into the atmosphere history files (bug 2). bug2 is the bug Abby is referring to, and absolutely needs to be fixed if we want to do any runs with pre-v1. FYI we are not running with full sea ice, just prescribed ice. We are not even sure if this is a cice issue, its just that grepping for restart points to cice. But in fact its more likely in the creation of the h0 files when restarts happen in the middle of the month, wherever that happens. We think we have a workaround, but the code needs to also insure that no one uses ndays for restart if that is creating garbage. Its hard to explain here, apologies. The confluence page linked above gives all the info.

douglasjacobsen commented 8 years ago

@susburrows I just reassigned you, since I'm not sure who this should be assigned to, but based on the comments it sounds like it should at least not be assigned to me or @eclare108213.

susburrows commented 8 years ago

OK. But it seems like a critical bug, and I don't have time to work on it right now, so it would be great if someone else is available who could look at it. @rljacob , any suggestions?

susburrows commented 8 years ago

Also tagging @philrasch

susburrows commented 8 years ago

maybe @gold2718 could help?

abigailgaddis commented 8 years ago

I removed the sea ice tag. I believe that is a separate known bug, given our discussion above. Clarification after further discussions/thought: the issue is going to be found in the output writing code, rather than the actual restart code.

Here's why: A restarted and non-restarted test simulation are bit for bit after restart (see the 3 day test under "Investigation"). When restarting in the middle of an output period (e.g. in the middle of a month with monthly output), the output file is being averaged/accumulated with zeroes for the days/hours from the previous run.

mt5555 commented 8 years ago

I confirmed with Jim that the ACME ERS test (exact restart test) doesn't look at component history files, only coupler output. The fact that all our ERS tests pass is thus extra confirmation that this is confined to the partial averaging being done in the history.

@rljacob : cant we update our comparison tests to also compare history files?

rljacob commented 8 years ago

Yes. Also the ERT test doest that and just needs to be configured with an ne16/mpas240 version of WYCL.

mt5555 commented 8 years ago

or an F compset, if that would be quicker?

jonbob commented 8 years ago

@rljacob I have the mapping files and config_grid changes ready for ne16/oQU240. We are planning on including that in a commit early or mid next week.

rljacob commented 8 years ago

The "testmods" capability can tell the atmosphere to output history more frequently for one of the existing F-cases.

cameronsmith1 commented 8 years ago

After the discussion on the atm concall today, I went and checked on my most recent run on titan. I could not find evidence of the error.

I was running with various job lengths (typically 40 days) and outputting monthly. In all my other runs I would have noticed this error if it applied to all fields, although I might miss it if it just applies to temperature. I am using ne30.

My most recent run was with v1.0.0-alpha.2-6-g6c6dbf4, so perhaps it was broken and then fixed?

Another possibility is that I am mostly running with output regridding turned on, so perhaps that avoids the bug somehow? (I am trying to check this, but Rhea is very slow.)

cameronsmith1 commented 8 years ago

I just found a simulation that didn't use regridding. I do not see the problem in that simulation either.

kevans32 commented 8 years ago

Philip- I just highlighted text in orange in this location: https://acme-climate.atlassian.net/wiki/display/ATM/Ensemble+Simulations+performed+to+document+and+evaluate+the+V0.1-V03+model+configuration, which outlines that running with inline interpolation is not recommended in CAM-SE. This was another bug we encountered along the way-this was a pre request hub issue and we didn't think anyone ran that way with ACME. I am not sure how to flag this issue for users, but the SE team will. So I'll have @abigailgaddis add that to request hub as well. I'll also ask @abigailgaddis to run the latest tag your way to verify there is no issue in the latest tag. Can you send her your latest setup since she is set up to look at this.

abigailgaddis commented 8 years ago

@cameronsmith1 I noticed the bug only in 3D Temperature (T), and Z3. I did not see the bug in TS (surface temp), U (3D wind) or U10. Did you check Z3 and T? My intuition is that its in only certain prognostic variables that are involved with restarts.

mt5555 commented 8 years ago

Regarding the inline interpolation bug, I think that was fixed with @gold2718 's phys grid work:

https://github.com/ACME-Climate/ACME/issues/600

kevans32 commented 8 years ago

Thanks Mark! I did not track that down.

cameronsmith1 commented 8 years ago

I think I experienced bug #600 in December, and haven't had problems since it was fixed.

My script for running my configuration is:

/lustre/atlas/world-shared/cli112/pjcs/for_abigail/run_acme.FC5_complete01_regrid.csh.ABIGAIL

It should create my configuration with the latest version of master. It should run out of the box, including doing all of the setup and a 5-day simulation.

The script can be run from anywhere, but it is cleanest to put it in a new directory and run it from there.

mt5555 commented 8 years ago

Rob mentioned above that the ERT tests will check history restart.

ACME only has one ERT test, (ne16_g37 B1850C5) that is somewhat similar to what Abby was running. That test is actually failing, but it reports errors in cice and clm history (not atmosphere). I think it is also restarting on 1 month boundaries, so it wont check the bug reported here.

Abby, can you run an ERT test? I think this would should be sufficient to see if the bug is still present:

./create_test ERT_Ld10.ne30_ne30.FC5

mt5555 commented 8 years ago

Update: the 10 day test I recommended above wont work - it does test a partial month restart, but doesn't run long enough to output a h0 file, so it wont test that.

We need some help from @rljacob here.

rljacob commented 8 years ago

Reassigning to @singhbalwinder since he dealt with the CAM IO PR.

What you have to do is alter the namelist so that the history frequency is N days and then make the restart frequency N/2.

ERT will compare normally written history files but I don't know if its looks at history restarts.

abigailgaddis commented 8 years ago

@mt5555 I'm not sure that this will detect the bug, depending on what the ERT checks. Does it check the output of the month/day after the restart, or does it just test that the data input to the model at restart is the same as the data coming out of the model at the first time step?

@cameronsmith1 I've got output from two simulations, one for 1.5 days, restart, 1.5 days and one for 3 days, with daily output. (Apologies for the long wait, I had some difficulties finding where things are with the new run script.)

It looks like the bug may be fixed as of v1.0.0-alpha.3-45-g2c82907? The restarted and non-restarted numbers are the same.

Global average 3D T at ~850 mb day not restarted restarted 0 205.1885 205.1885 1 205.2108 205.1885 2 205.2736 205.2736 3 205.3263 205.3263

mt5555 commented 8 years ago

I just verified that this will be a good test:

./create_test ERT_Ld31.ne16_ne16.FC5

ERT_Ld31 will run for 31 days, writing a restart at day 16. It will then do a 15 day restart run. The h0 files from the original 1 month run, and the restart run are then compared:

COMMENT for ERT_Ld31.ne16_ne16.FC5.skybridge_intel : cam.h0.nc : test compare cam.h0 (.base and .rest files) COMMENT for ERT_Ld31.ne16_ne16.FC5.skybridge_intel : cice.h.nc : test compare cice.h (.base and .rest files) COMMENT for ERT_Ld31.ne16_ne16.FC5.skybridge_intel : clm2.h0.nc : test compare clm2.h0 (.base and .rest files) COMMENT for ERT_Ld31.ne16_ne16.FC5.skybridge_intel : cpl.hi.nc : test compare cpl.hi (.base and .rest files)

Results for this test on skybridge: CAM h0 output is BFB. Both CICE and CLM fail.

So I agree with @abigailgaddis and this bug appears to be fixed in CAM.

mt5555 commented 8 years ago

for the record, here are the CICE and CLM failures.

/gscratch/mataylo/acme_scratch/skybridge/ERT_Ld31.ne16_ne16.FC5.skybridge_intel.20160324_160826/run/ERT_Ld31.ne16_ne16.FC5.skybridge_in$

RMS time_bounds 1.1314E+01 NORMALIZED 2.9424E-02 RMS hi 1.9093E-02 NORMALIZED 2.0927E-01 RMS hs 4.9096E-03 NORMALIZED 5.7696E-01 RMS fs 3.0721E-02 NORMALIZED 6.3013E-01 RMS Tsfc 7.7491E-01 NORMALIZED 2.4696E-01 RMS aice 1.2564E+00 NORMALIZED 2.3829E-01 RMS qi Infinity NORMALIZED Infinity RMS qs 2.1230E+16 NORMALIZED 5.4238E-01 RMS fswdn 1.6649E+01 NORMALIZED 8.1195E-02 RMS fswup 1.6482E+01 NORMALIZED 8.0788E-02 RMS flwdn 6.2639E+00 NORMALIZED 1.7800E-02 RMS snow 2.1010E-02 NORMALIZED 1.0473E+00 RMS snow_ai 6.9869E-03 NORMALIZED 2.2414E+00 RMS rain 1.5400E-01 NORMALIZED 4.5848E-01 RMS rain_ai 6.8895E-04 NORMALIZED 1.1084E+01 RMS fswfac 1.2414E-01 NORMALIZED 1.2253E-01 RMS fswabs 2.8760E+00 NORMALIZED 1.2465E+00 RMS fswabs_ai 1.0568E+00 NORMALIZED 1.0284E+00 RMS alvdr 1.2465E+00 NORMALIZED 7.9060E-01 RMS alidr 9.4259E-01 NORMALIZED 8.0279E-01 RMS alvdf 1.2773E+00 NORMALIZED 8.2529E-01 RMS alidf 9.1300E-01 NORMALIZED 8.5667E-01 RMS albice 1.6721E+00 NORMALIZED 4.5241E+00 RMS albsno 1.4667E+00 NORMALIZED 5.5141E-01 RMS albpnd 4.2469E-01 NORMALIZED 6.0370E+00 RMS coszen 1.3731E-02 NORMALIZED 8.1844E-02 RMS flat 6.0227E-01 NORMALIZED 1.6784E+00 RMS flat_ai 2.6011E-01 NORMALIZED 1.6599E+00 RMS fsens 8.0363E-01 NORMALIZED 9.5795E-01 RMS fsens_ai 6.0605E-01 NORMALIZED 1.1270E+00 RMS flwup 2.8515E+00 NORMALIZED 9.4379E-03 RMS flwup_ai 3.8566E+00 NORMALIZED 3.2821E-01 RMS evap 1.8457E-03 NORMALIZED 1.6877E+00 RMS evap_ai 7.9676E-04 NORMALIZED 1.6737E+00 RMS Tair 8.2921E-01 NORMALIZED 4.7411E-02 RMS Tref 8.5352E-01 NORMALIZED 2.9656E-03 RMS Qref 4.4158E-01 NORMALIZED 4.3674E-02 RMS melts 4.0821E-02 NORMALIZED 1.1514E+01 RMS fswthru 5.5417E-01 NORMALIZED 2.4648E+00 RMS fswthru_ai 1.7931E-01 NORMALIZED 2.4821E+00 RMS ice_present 3.1887E-02 NORMALIZED 3.8190E-01 RMS fsurf_ai 1.1118E+00 NORMALIZED 1.0180E+00 RMS fcondtop_ai 8.7930E-01 NORMALIZED 7.9753E-01 /gscratch/mataylo/acme_scratch/skybridge/ERT_Ld31.ne16_ne16.FC5.skybridge_intel.20160324_160826/run/ERT_Ld31.ne16_ne16.FC5.skybridge_i$

RMS BUILDHEAT 1.9406E+01 NORMALIZED 1.3072E+00 RMS HEAT_FROM_AC 9.8878E-04 NORMALIZED 3.2957E+01 RMS WASTEHEAT 1.8663E+01 NORMALIZED 1.5064E+00 FILLDIFF BUILDHEAT FILLDIFF HEAT_FROM_AC FILLDIFF WASTEHEAT

mt5555 commented 8 years ago

any objections if we close this issue?

kevans32 commented 8 years ago

Nope. Is there a place folks should record known bugs in deprecated versions of ACME that we should require folks to check before running new sims with old models (I imagine that will be rare but will happen)?

From: Mark Taylor notifications@github.com<mailto:notifications@github.com> Reply-To: ACME-Climate/ACME reply@reply.github.com<mailto:reply@reply.github.com> Date: Thursday, March 24, 2016 at 2:28 PM To: ACME-Climate/ACME ACME@noreply.github.com<mailto:ACME@noreply.github.com> Cc: "Evans, Katherine J." evanskj@ornl.gov<mailto:evanskj@ornl.gov> Subject: Re: [ACME] Restart output partial averaging incorrect in ne16 F1850, possibly ne30. (#757)

any objections if we close this bug?

You are receiving this because you commented. Reply to this email directly or view it on GitHubhttps://github.com/ACME-Climate/ACME/issues/757#issuecomment-200960809

cameronsmith1 commented 8 years ago

I am happy if we close this. The best place I know of to document bugs for posterity is https://acme-climate.atlassian.net/wiki/display/ATM/Problems+running+v1-alpha+model