v1.0.0-alpha.6 is not bit-for-bit across restarts for WCYCL2000 case

golaz commented 8 years ago

My A_WCYCL2000 ne30_oEC alpha6 simulation crashed last week during year 23. I reran the year in small increments in order to get a restart file close to the crash point. As it turned out, the model was able to continue pass the original failure.

@PeterCaldwell suggested to check whether my rerun is bit-for-bit identical to my initial simulation. Unfortunately it is not. The global integrals in atm.log* are identical for the first time step of the restart, but then diverge.

Original simulation:

nstep, te   385441   0.33063761817603474E+10   0.33063812575473948E+10   0.28080246939116995E-03   0.98485669115724886E+05
nstep, te   385442   0.33063591788849158E+10   0.33063643582962089E+10   0.28653505266863723E-03   0.98485707853365748E+05
nstep, te   385443   0.33063493095645571E+10   0.33063530899253807E+10   0.20913677691368276E-03   0.98485753633711473E+05
...

Rerun:

nstep, te   385441   0.33063761817603474E+10   0.33063812575473948E+10   0.28080246939116995E-03   0.98485669115724886E+05
nstep, te   385442   0.33063591789527164E+10   0.33063643582696991E+10   0.28652983524590721E-03   0.98485707851780651E+05
nstep, te   385443   0.33063493116577883E+10   0.33063530915372243E+10   0.20911014563527259E-03   0.98485753632128806E+05
...

douglasjacobsen commented 8 years ago

@golaz Do you have coupler history files for the same time in the two runs?

golaz commented 8 years ago

No, I don't. They were overwritten.

singhbalwinder commented 8 years ago

Does this compset and grid combination pass the ERS test? Has anybody checked?

golaz commented 8 years ago

@singhbalwinder : can you check?

singhbalwinder commented 8 years ago

Sure. Would you please copy-paste your create_newcase line here so that I can test the exact same case you are running?

golaz commented 8 years ago

create_newcase -case ... -mach edison -compset A_WCYCL2000 -res ne30_oEC -project acme -pecount M

singhbalwinder commented 8 years ago

Thanks. I am running a test now.

golaz commented 8 years ago

@singhbalwinder : can check the outcome of your test?

singhbalwinder commented 8 years ago

I was trying to run the test for the shortest period of time so I ran a 5 time step ERS test. The test failed to execute as runoff model is not invoked during the first 5 time steps (got an error that rpointer.rof is missing). I have again submitted the test for 5 days and it is still in queue.

singhbalwinder commented 8 years ago

@golaz : My 5 day ERS (restart) test failed with differences in the following fields:

---summarizing more details of test failures if any: --- 

/lustre/atlas1/cli112/scratch/bsingh/tests_acme//ERS_Ld5.ne30_oEC.A_WCYCL2000.eos_intel.wcyc1/run/ERS_Ld5.ne30_oEC.A_WCYCL2000.eos_intel.wcyc1.cpl.hi.0001-01-06-00000.nc.rest.cprnc.out had the following fields that are NOT b4b  

   RMS fraca_ifrac                      1.8726E-04            NORMALIZED  3.5283E-03
    RMS fraca_ofrac                      1.8726E-04            NORMALIZED  2.8659E-04
    RMS x2a_Sf_ifrac                     1.8726E-04            NORMALIZED  3.5283E-03
    RMS x2a_Sf_ofrac                     1.8726E-04            NORMALIZED  2.8659E-04
    RMS x2a_Sx_avsdr                     3.1267E-03            NORMALIZED  5.3947E-03
    RMS x2a_Sx_anidr                     1.6318E-03            NORMALIZED  2.8696E-03
    RMS x2a_Sx_avsdf                     2.2931E-03            NORMALIZED  4.0867E-03
    RMS x2a_Sx_anidf                     1.3430E-03            NORMALIZED  2.4374E-03
    RMS x2a_Sx_tref                      2.7805E-01            NORMALIZED  9.7045E-04
    RMS x2a_Sx_qref                      1.5438E-04            NORMALIZED  1.5724E-02
    RMS x2a_So_t                         2.2110E-02            NORMALIZED  1.0202E-04
    RMS x2a_Sx_t                         3.9815E-01            NORMALIZED  1.3866E-03
    RMS x2a_Sl_fv                        6.9203E-02            NORMALIZED  6.8325E-01
    RMS x2a_Sl_ram1                      9.2521E+01            NORMALIZED  1.0474E+00
    RMS x2a_Sl_snowh                     6.0507E-05            NORMALIZED  1.6135E-03
    RMS x2a_Si_snowh                     8.6817E-05            NORMALIZED  6.6718E-03
    RMS x2a_So_ssq                       2.7858E-05            NORMALIZED  2.5812E-03
    RMS x2a_So_re                        3.2522E-04            NORMALIZED  1.1545E-02
    RMS x2a_Sx_u10                       5.4009E-01            NORMALIZED  8.5589E-02
    RMS x2a_So_ustar                     7.8863E-03            NORMALIZED  4.1135E-02
    RMS x2a_Faxx_taux                    7.1787E-02            NORMALIZED  7.4534E-01
    RMS x2a_Faxx_tauy                    5.4292E-02            NORMALIZED  7.1477E-01
    RMS x2a_Faxx_lat                     1.0923E+01            NORMALIZED  1.3072E-01
    RMS x2a_Faxx_sen                     7.6278E+00            NORMALIZED  3.5537E-01
    RMS x2a_Faxx_lwup                    2.0133E+00            NORMALIZED  5.1454E-03
    RMS x2a_Faxx_evap                    4.3668E-06            NORMALIZED  1.3084E-01
    RMS x2a_Fall_flxdst1                 6.1411E-10            NORMALIZED  7.5156E+00
    RMS x2a_Fall_flxdst2                 3.2964E-09            NORMALIZED  7.5156E+00
    RMS x2a_Fall_flxdst3                 7.7297E-09            NORMALIZED  7.5156E+00
    RMS x2a_Fall_flxdst4                 7.2811E-09            NORMALIZED  7.5156E+00

I just checked and found out that there is no ERS test for this compset in the testing suite. I think we should fix this and add a test to detect such failures.

jonbob commented 8 years ago

@singhbalwinder : I'll try it as well and see if I can figure out which component is having issues

jonbob commented 8 years ago

@singhbalwinder : I did replicate your failure for the ERS test using the A_WCYCL compset on edison. A similar "G" case -- meaning active ocn/cice -- passed, so I'll try a test with atm/lnd and see what happens.

singhbalwinder commented 8 years ago

Thanks @jonbob. I did run an ERS test with an F case (active atm/lnd) using a different grid and it passed. I will be interested in your results with active atm/lnd test.

rljacob commented 8 years ago

@singhbalwinder, did you do a F1850C5AV1C-L case?

singhbalwinder commented 8 years ago

No. I used FC5AV1C-02 for the test and it was a few weeks back.

EDIT: I tested FC5AV1C-01 actually and it passed. I didn't test FCAV1C-02 for ERS.

jonbob commented 8 years ago

@rljacob , I'm using FC5AV1C-03 -- is that OK? But stuck in the queue waiting for processors. If it's not ocn/ice or atm/lnd, what does that leave? Mosart?

rljacob commented 8 years ago

Wait I guess that should be F2000C5AV1C-L which I don't think exists as a compset.

rljacob commented 8 years ago

Yes FC5AV1C-03 is ok.

jonbob commented 8 years ago

@singhbalwinder: my F case failed the test. It ran fine, but just failed the compare part....

singhbalwinder commented 8 years ago

Interesting. I will take a look and report back.

jonbob commented 8 years ago

@singhbalwinder It also failed on one of our local IC machines using the gnu compiler. So not a NERSC or intel specific issue....

rljacob commented 8 years ago

ERS test with WCYCL2000 also fails on sandia machines for both alpha.6 and alpha.5. @golaz do you know that restarts were working in alpha.5 runs?

golaz commented 8 years ago

@rljacob : I don't know whether restarts were working with alpha.5.

singhbalwinder commented 8 years ago

I got the same failure for FC5AV1C-03 on EOS.

singhbalwinder commented 8 years ago

FC5AV1C-02 also fails the ERS test on EOS.

rljacob commented 8 years ago

According to this comment https://github.com/ACME-Climate/ACME/issues/802#issuecomment-214523508 by @jonbob an ERS_Ld7.ne30_oEC.A_WCYCL2000.corip1_intel test passed with adding -no-opt-dynamic-align option to FFLAGS. Not sure exactly which version but is was BEFORE the alpha.5 tag and after alpha.4. So everyone try an ERS test with alpha.4 on your favorite machine.

jonbob commented 8 years ago

@rljacob I tried on one of our IC machines -- alpha.4 passed and alpha.5 failed

singhbalwinder commented 8 years ago

I tested FC5AV1C-01 and it passes the test while FC5AV1C-02 fails the test. I am looking into it to learn more about the failure.

rljacob commented 8 years ago

That makes sense with what @jonbob found. alpha.4 had FC5AV1C-01 while alpha.5 had FC5AV1C-02.

jgfouca commented 8 years ago

@rljacob : I tried ERS.ne30_oEC.A_WCYCL2000 with v1.0.0-alpha.4 on skybridge and it worked. It looks like roughly 30 merges were done between v1.0.0-alpha.4 and v1.0.0-alpha.5. I could try acme_bisect if you want.

singhbalwinder commented 8 years ago

I tested and I think it is the commit dc4d79d5472, where FC5AV1C-02 was introduced,broke the ERS test. There are very minor changes made to the code in this commit. I am currently looking into it.

golaz commented 8 years ago

@singhbalwinder : did you have a chance to look more into this?

singhbalwinder commented 8 years ago

I did look at it and found out that the restart works when do_tms (turbulent mountain stress) is .true. When it is false, the restart is not BFB. I am still not able to track the exact reason behind it. I am working on it.

golaz commented 8 years ago

@singhbalwinder, thanks, this helps narrow down the problem. I don't know if it is related, but I do recall some discussion about a bug in the interaction between CLUBB and TMS. The atmosphere group was trying to turn off TMS (using do_tms), but something was still hard coded in CLUBB. You could check with @polunma, he probably knows more about this.

singhbalwinder commented 8 years ago

I remember fixing that issue in the code. I have already added that fix in the code I am working with.

I am going to explain here what I have found so far, somebody might have some insight into it.

When do_tms is true, compute_tms is invoked to get a value for "ksrftms" variable, otherwise (if do_tms is false) this variable is set to 0.0_r8. "ksrftms" is used as follows (clubb_intr.F90):

upwp_sfc = upwp_sfc-((ksrftms(i)*state1%u(i,pver))/rho_ds_zm(1))
vpwp_sfc = vpwp_sfc-((ksrftms(i)*state1%v(i,pver))/rho_ds_zm(1))

I noticed that if I assign "ksrftms" any other value except for 0.0_r8, the answers are BFB. If "ksrftms" is 0.0_r8, the asnwers are non-BFB (e.g. state%pdel is non-BFB at the top of tphysbc).

mt5555 commented 8 years ago

@ndkeen is going to try and help with this.
Based on @singhbalwinder 's comments: I think this bug will show up in the ERS test with the FC5AV1C-02 compset.

I think you can run this test via: ./create_test ERS.ne30_oEC.FC5AV1C-02

The defaults might run too long and not enough cores. To specify number of days and cores: (5 days, 432 cores), something like this works:

./create_test ERS_Ld5_P432.ne30_oEC.FC5AV1C-02 Edit: @singhbalwinder can reproduce with just 5 timsteps, which would be:

432 cores, 5 timesteps ERS test: ./create_test ERS_Ln5_P432.ne30_oEC.FC5AV1C-02

mt5555 commented 8 years ago

some info from @philrasch :

Briefly, we can now demonstrate that the model does not restart when we turn off turbulent mountain stress (TMS), and Balwinder has isolated the problem to a couple of lines of code that enable or prevent the problem. And it apparently is reproducible on 2 or 3 different compilers, so it is not a compiler problem. The problem is testable with a 5 day simulation, but it is not testable at lower horizontal or vertical resolution, unfortunately. Balwinder asked Wuyin for help late last week. We want TMS off. The problem is also present apparently when we just make TMS weaker.

singhbalwinder commented 8 years ago

Thanks @mt5555 and @ndkeen ! I am able to reproduce this behavior with a 5 time step run. I have described above which code lines are responsible for this behavior. I am working with @wlin7 now to find the reason behind this problem. As @philrasch described above, the problem is reproducible on several different machines and compilers (Intel, PGI)

ndkeen commented 8 years ago

Although we are out of time at NERSC, I built and submitted a job into scavenger Q on edison. However, it looks like scavenger jobs on Cori are running quickly -- possibly due to users slowly adjusting to the new software upgrade. One of my current tasks is to adjust the environment vars and test ACME. I did this today, making some changes to env_mach_specific. One of the acme_developer tests fail (a compare fail with ERP_Ln9.ne30_ne30.FC5.corip1_intel.cam-outfrq9s) so it may not be ready for next (you can check out for yourself in ndk/mach-files/cori-module-mods-after-upgrade).

I ran ./create_test ERS_Ld5_P432.ne30_oEC.FC5AV1C-02 and I do see compare failures. I'm re-running with Debug.

I also tried Ln5 as @mt5555 suggests, but I must be missing a file. This example shows with GNU, but also tried with Intel. ./create_test ERS_Ln5_P432.ne30_oEC.FC5AV1C-02 --compiler gnu

I got an error with GNU:

001:  Opened existing file ERS_Ln5_P432.ne30_oEC.FC5AV1C-02.corip1_gnu.g906m06gnu01.clm2.rh0.0001-01-01-05\
400.nc     1769472
000: At line 265 of file /global/cscratch1/sd/ndk/wacmy/master06/cime/..//components/rtm/src/riverroute/Rt\
mRestFile.F90 (unit = 93, file = './rpointer.rof')
000: Fortran runtime error: End of file

Note on Cori we now have Intel version 16.0.3.210 which is slight bump from the previous version 16.0.0.109 (I don't even see this previous version installed). The GNU compiler version is 5.3.0

worleyph commented 8 years ago

@ndkeen, I saw an update fly by on the CIME repository related to updating modules for Corip1. @jgfouca , any chance that you could forward this information to @ndkeen ?

jgfouca commented 8 years ago

@ndkeen : This was the PR @worleyph is talking about: https://github.com/ESMCI/cime/pull/236

rljacob commented 8 years ago

The corip1 env change discussion should be moved to issue #932

ndkeen commented 8 years ago

I ran the ./create_test ERS_Ld5_P432.ne30_oEC.FC5AV1C-02 with the GNU compiler on Cori and I also get a failure in compare:

FAIL ERS_Ld5_P432.ne30_oEC.FC5AV1C-02.corip1_gnu compare

I tried Intel with Debug but having trouble getting it to finish (running out of walltime).

Also, I'm using master which may not be the right code base.

singhbalwinder commented 8 years ago

Thanks @ndkeen for the update. I am not sure why do you think master might not be the right code base (the bug is present in the master code). I didn't run the ERS test in debug mode as it will take a long time to run due to high vertical and horizontal resolution. I was not able to reproduce it using a lower resolution.

rljacob commented 8 years ago

@singhbalwinder when you say "I remember fixing that issue in the code. I have already added that fix in the code I am working with" in https://github.com/ACME-Climate/ACME/issues/906#issuecomment-227601418 does that mean its on a private branch or is the fix in master?

singhbalwinder commented 8 years ago

The fix is in the master (PR #892)

worleyph commented 8 years ago

@singhbalwinder , I am confused by your comment

 "ksrftms" is used as follows (clubb_intr.F90):

 upwp_sfc = upwp_sfc-((ksrftms(i)*state1%u(i,pver))/rho_ds_zm(1))
 vpwp_sfc = vpwp_sfc-((ksrftms(i)*state1%v(i,pver))/rho_ds_zm(1))

as this code is within the if-test:

   if ( do_tms) then
      upwp_sfc = upwp_sfc-((ksrftms(i)*state1%u(i,pver))/rho_ds_zm(1))
      vpwp_sfc = vpwp_sfc-((ksrftms(i)*state1%v(i,pver))/rho_ds_zm(1))
   endif

so is not executed when do_tms is false. So the sensitivity to ksrftms must come from some other location?

ndkeen commented 8 years ago

I have to admit that I'm not sure what is meant by v1.0.0-alpha.6, but it likely means a version of the code. By using master, I might be adding confusion -- but if you say it's OK. :)

I also tried to verify that setting do_tms=.true. to see if the problem goes away, but I could not figure out how to do it. I see it in run/atm_in, but setting it there doesn't seem right.

worleyph commented 8 years ago

@ndkeen, you can probably just add this to user_nl_atm .

singhbalwinder commented 8 years ago

@worleyph : You are right. These lines are not executed when do_tms is false. Other way to get the same effect is to set ksrftms=0 when do_tms is false and remove the if (do_tms) line. I modified my code like that so that I can play with ksrftms to understand its sensitivity. Does that make sense? We have learnt that if we reduce the effect of ksrftms (by multiplying it with 1.e-3 and setting do_tms=.true.), the restarts are non-BFB.

@ndkeen : For ERS tests, you can modify the use case file for this compset (FC5AV1C-03) to set do_tms = .true. If you are manually running restart tests then you can do what @worleyph suggested above. The file to modify in that case would be user_nl_cam in the CASEROOT directory.

E3SM-Project / E3SM

v1.0.0-alpha.6 is not bit-for-bit across restarts for WCYCL2000 case #906