E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
344 stars 352 forks source link

EAM: atm_run_mct seems to do something wrong in the first timestep #5904

Open bartgol opened 1 year ago

bartgol commented 1 year ago

It appears that during the first call to atm_run, EAM performs the cam_run2-3-4-1 sequence twice instead of once. The odd behavior can be seen running the test SMS_Ln5.ne4pg2_oQU480.F2010.mappy_gnu.eam-thetahy_pg2 (part of the e3sm_developer testsuite).

If you look at the test input parameters, you will find that

This means that homme's prim_run_subcycle should be invoked 10 times (5 steps x 2 calls/step). However, looking at model_timing_stats, you will notice that the number of calls to the corresponding timer (a:prim_run_subcycle) is 12*NTASKS. Hacking the code and adding some print statements in atm_run and cam_runX, I got a confirmation of the "extra" step:

 atm_run, clock in (ymd,tod):       10101        3600
   inside 'while(.not. dosend)' loop, get_curr_date (ymd,tod):       10101           0
     cam_run2, dtime=   3600.0000000000000     
     cam_run3, dtime=   3600.0000000000000    
     cam_run4
     cam_run1, dtime=   3600.0000000000000
   inside 'while(.not. dosend)' loop, get_curr_date (ymd,tod):       10101        3600
     cam_run2, dtime=   3600.0000000000000     
     cam_run3, dtime=   3600.0000000000000    
     cam_run4
     cam_run1, dtime=   3600.0000000000000
 atm_run, clock in (ymd,tod):       10101        7200
   inside 'while(.not. dosend)' loop, get_curr_date (ymd,tod):       10101        7200
     cam_run2, dtime=   3600.0000000000000     
     cam_run3, dtime=   3600.0000000000000    
     cam_run4
     cam_run1, dtime=   3600.0000000000000
 atm_run, clock in (ymd,tod):       10101       10800
 ...

There's an extra run2-3-4-1 sequence in the first timestep, and each of the two sequences is using a full timestep (it may be "ok" (though still puzzling) if the two steps were with dt=dt/2). This appears to be caused by the fact that the EAM internal clock is at t=0 upon entry of the first time step, while the CIME clock is already at t=dt. The end condition for the while loop (dosend) is a check on whether the CIME and EAM clocks are in sync, which therefore fails at the first step.

I am thinking that this is not the expected behavior, so I'm labeling as a bug. But I don't understand EAM's internal timestep logic to be sure of this.

bartgol commented 1 year ago

My guess is that there's a difference in how EAM and CIME assume their internal clocks to be:

These two assumptions conflict and ultimately generate the bug when the two clocks are compared inside the while loop, to decide whether EAM has completed its time step, causing EAM to take 2 steps (instead of 1) to catch up with CIME's clock.