COSIMA / access-om3

ACCESS-OM3 global ocean-sea ice-wave coupled model
13 stars 6 forks source link

Issues related to stop_option = nsteps in nuopc.runconfig #157

Closed minghangli-uni closed 1 month ago

minghangli-uni commented 5 months ago

It appears that there are issues related to stop_option = nsteps, contrary to what was proposed in the wiki.

{model_name}_cpl_dt are unused and the driver time-step equals the coupling time-step set in nuopc.runseq

The default CLOCK setup in nuopc.runconfig for the current 0.25deg configuration is listed as follows, but changing restart_n=10, stop_n=10 and restart_option = nsteps:

280 CLOCK_attributes::
281      atm_cpl_dt = 3600
282      calendar = NO_LEAP
283      end_restart = .false.
284      glc_avg_period = yearly
285      glc_cpl_dt = 86400
286      history_ymd = -999
287      ice_cpl_dt = 3600
288      lnd_cpl_dt = 3600
289      ocn_cpl_dt = 3600
290      restart_n = 10
291      restart_option = nsteps
292      restart_ymd = -999
293      rof_cpl_dt = 3600
294      start_tod = 0
295      start_ymd = 19000101
296      stop_n = 10
297      stop_option = nsteps
298      stop_tod = 0
299      stop_ymd = -999
300      tprof_n = -999
301      tprof_option = never
302      tprof_ymd = -999
303      wav_cpl_dt = 3600

An error occurs with the above setup:

ERROR PET239 src/addon/NUOPC/src/NUOPC_Base.F90:956 Invalid argument - setClock timeStep=1350s is not a divisor of runDuration=36000s

This error suggests that these timesteps are still in use (i.e., 3600*10/1350). A more consolidated evidence can be found by checking the ESMF profiling results. When changing stop_n to 12, the code can be run successfully. This is because 3600*12 can divide 1350 and equals 32, as evidenced by the count for [OCN] RunPhase1, which is 32 instead of 12.

 10 Region                                                                 PETs   PEs    Count    Mean (s)    Min (s)     Min PET Max (s)     Max PET
 11   [ESMF]                                                               1644   1644   1        248.4059    211.3903    201     256.8084    937
 12     [ensemble] RunPhase1                                               1644   1644   1        109.6005    109.2641    13      109.7894    1599
 13       [ESM0001] RunPhase1                                              1644   1644   1        109.6003    109.2638    11      109.7892    1599
 14         [OCN] RunPhase1                                                1344   1344   32       82.4511     82.2578     1138    82.9324     240

NB:

days and years are functioning properly, so this issue should not impact production runs. However, it's worth noting for anyone interested in conducting short tests.

anton-seaice commented 5 months ago

Thanks minghang for the explanation :) That behaviour seems correct? We could add a note to the wiki

We can do 1-timestep in 1-deg configs because the ocean timestep (DT_THERM) equals the coupling timestep ocn_cpl_dt

As you said, in 0.25 degree, the smallest stop_n that gives a whole number of DT_THERM is 12. (i.e. 12*3600/1350 is a whole number)

minghangli-uni commented 5 months ago

For the existing 025 deg config, DT_THERM equals ocn_cpl_dt too, which is 1350s.

As you said, in 0.25 degree, the smallest stop_n that gives a whole number of DT_THERM is 12. (i.e. 12*3600/1350 is a whole number)

I dont follow.. the smallest stop_n that gives a whole number is 3, right?

minghangli-uni commented 5 months ago

Are you suggesting that since the ocn_cpl_dt is 3600, achieving a whole number with 12 nsteps (12*3600/1350) would indicate correctness?

We might be discussing different things here. From wiki, it suggests that {model_name}_cpl_dt are unused and the driver time-step equals the coupling time-step set in nuopc.runseq. But when the restart_option is set to nsteps, {model_name}_cpl_dt comes into play and overrides the coupling timestep set in nuopc.runseq.

dougiesquire commented 5 months ago

Good catch @minghangli-uni. So it looks like we are wrong about the <component>_cpl_dt variables not being used. The question then is where/why/how are they being used. Are they just used for sanity checks like this one or are they doing more than that?

Note I did write a caveat in the wiki 😉:

However, I would feel more comfortable if I understood why {model_name}_cpl_dt are ever needed...

dougiesquire commented 5 months ago

@ezhilsabareesh8, I presume you didn't try to run either of these PRs with restart_option = nsteps?:

ezhilsabareesh8 commented 5 months ago

@dougiesquire , I just tried running the IAF config for 5 steps with restart_option = nsteps it is working fine and the ice output prints the correct dt.

  Calendar
 --------------------------------
 days_per_year    =            365  : number of days in a model year
 use_leap_years   =              T  : leap days are included
 dt               =        3600.00  : model time step
CLOCK_attributes::
     atm_cpl_dt = 99999     #not used
     calendar = GREGORIAN
     end_restart = .false.
     glc_avg_period = yearly
     glc_cpl_dt = 86400
     history_ymd = -999
     ice_cpl_dt = 99999    #not used
     lnd_cpl_dt = 99999    #not used
     ocn_cpl_dt = 99999    #not used
     restart_n = 5
     restart_option = nsteps
     restart_ymd = -999
     rof_cpl_dt = 99999    #not used
     start_tod = 0
     start_ymd = 19580101
     stop_n = 5
     stop_option = nsteps
     stop_tod = 0
     stop_ymd = -999
     tprof_n = -999
     tprof_option = never
     tprof_ymd = -999
     wav_cpl_dt = 99999    #not used
::
minghangli-uni commented 5 months ago

@ezhilsabareesh8 Can you please post the directory for this iaf run?

anton-seaice commented 5 months ago

It does look like the cpl_dt are important for setting at least the mediator timestep:

https://github.com/ESCOMP/CMEPS/blob/3b1e50baab57c739434a4e62937a96a7bea3faf3/cesm/driver/esm_time_mod.F90#L157

Its suprising it just work without it!

ezhilsabareesh8 commented 5 months ago

@ezhilsabareesh8 Can you please post the directory for this iaf run?

It's in my home directory, let me copy it to a different location. I am just running the MOM6-CICE6 IAF configuration with the nuopc settings mentioned above.

dougiesquire commented 5 months ago

It does look like the cpl_dt are important for setting at least the mediator timestep:

https://github.com/ESCOMP/CMEPS/blob/3b1e50baab57c739434a4e62937a96a7bea3faf3/cesm/driver/esm_time_mod.F90#L157

@anton-seaice, this is what I wrote about this in the wiki, but clearly deeper investigation is needed:

The nuopc.runseq file specifies the run sequence of the configuration. The run sequence for current ACCESS-OM3 configurations comprises a single loop, with the coupling time-step specified at the start of the loop (this is the “timeStep” of the loop in NUOPC-speak).

Note, that there are parameters {model_name}_cpl_dt set in the CLOCK_attributes section of nuopc.runconfig. The only place these are used in CMEPS is to set the driver time-step as the minimum of these values. However from the NUOPC documentation and CMEPS codebase:

"Each time loop has its own associated clock object. NUOPC manages these clock objects, i.e. their creation and destruction, as well as startTime, endTime, timeStep adjustments during the execution. The outer most time loop of the run sequence is a special case. It uses the driver clock itself. If a single outer most loop is defined in the run sequence provided by freeFormat, this loop becomes the driver loop level directly. Therefore, setting the timeStep or runDuration for the outer most time loop results modifying the driver clock itself. However, for cases with concatenated loops on the upper level of the run sequence in freeFormat, a single outer loop is added automatically during ingestion, and the driver clock is used for this loop instead."

So I think in our case, {model_name}_cpl_dt are unused and the driver time-step equals the coupling time-step set in nuopc.runseq. Certainly, changing these values seems to have no effect. However, I would feel more comfortable if I understood why {model_name}_cpl_dt are ever needed...

minghangli-uni commented 5 months ago

I just tried running the IAF config for 5 steps with restart_option = nsteps it is working fine

Hi @ezhilsabareesh8, it appeears that you haven't modified glc_cpl_dt, which remains at 86400s instead of 99999s.

CLOCK_attributes::
     atm_cpl_dt = 99999     #not used
     calendar = GREGORIAN
     end_restart = .false.
     glc_avg_period = yearly
     glc_cpl_dt = 86400
     history_ymd = -999
     ice_cpl_dt = 99999    #not used
    ...

Hence the updated total runlength was calculated as 86400*5/1350=320, resulting in a whole number and allowing the run to proceed without any issues.

the ice output prints the correct dt.

Despite the total runlength updating, the timestep for each component remains unchanged at 1350s. Hence, you can determine your dt_ice_thermo to be 1350s.

ezhilsabareesh8 commented 5 months ago

it appeears that you haven't modified glc_cpl_dt, which remains at 86400s instead of 99999s.

When I set glc_cpl_dt to 99999, I am getting the following error

20240506 102538.059 ERROR PET11 (ice_comp_nuopc):(ModelAdvance) CICE clock not in sync with ESMF model clock

minghangli-uni commented 5 months ago

When I set glc_cpl_dt to 99999, I am getting the following error 20240506 102538.059 ERROR PET11 (ice_comp_nuopc):(ModelAdvance) CICE clock not in sync with ESMF model clock

I did the same run but didn't meet the error you described. However, the error message I received was similar to what I initially reported.

PET00 src/addon/NUOPC/src/NUOPC_Base.F90:956 Invalid argument  - setClock timeStep=1350s is not a divisor of runDuration=99999s

Have you previously reported this issue elsewhere?

minghangli-uni commented 3 months ago

<component>_cpl_dt is used to set up the minimum driver timestep, as explained by @dougiesquire in wiki. Since our runsequence uses only a outer loop, the timestep determined here will be overwritten by the coupling time-step.

The only caveat is when using stop_option = nsteps, the total run duration will be modifed by

   case (optNSteps,trim(optNSteps)//'s')
      call ESMF_ClockGet(clock, TimeStep=AlarmInterval, rc=rc)
      if (ChkErr(rc,__LINE__,u_FILE_u)) return
      AlarmInterval = AlarmInterval * opt_n

For the other options, such as stop_option = nseconds, the AlarmInterval is hardcoded to be 1 second, similar to nminutes (60 seconds) and nhours(3600 seconds).

   case (optNMinutes,trim(optNMinutes)//'s')
      call ESMF_TimeIntervalSet(AlarmInterval, s=60, rc=rc)
      AlarmInterval = AlarmInterval * opt_n

The AlarmInterval here can be considered as the unit for the stop_option. Then it is overwritten by AlarmInterval = AlarmInterval * opt_n, which is the total run duration for each run.

dougiesquire commented 3 months ago

Thanks @minghangli-uni. So I think the safest/clearest way forward is to set (at least) ocn_cpl_dt to the coupling timestep (and possibly also ice_cpl_dt and even atm_cpl_dt if we think that adds clarity though it's obviously not necessary) and the rest to something large and obviously meaningless (e.g. 99999). The wiki should also be updated. Let's discuss this in the TWG meeting today.

access-hive-bot commented 3 months ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/cosima-twg-meeting-minutes-2024/1734/11

aekiss commented 2 months ago

To avoid confusion I think we should add a comment to ocn_cpl_dt in nuopc.runconfig such as

     ocn_cpl_dt = 1350 # ignored (coupling timestep set by nuopc.runseq) unless stop_option=nsteps
minghangli-uni commented 2 months ago

Thanks @aekiss. We can use the automatic cherry-pick tool to add this comments to all the branches.