NCAR / MOM6

NCAR/CESM fork of the Modular Ocean Model v.6 (MOM6)
Other
2 stars 19 forks source link

crash: unused buoyancy fluxes #290

Open minghangli-uni opened 1 month ago

minghangli-uni commented 1 month ago

We met this error when the ratio of the tracer timestep to the coupling timestep is 3:1, (e.g., 10800s : 3600s, 5400s : 1800s, or 8100s : 2700s)

FATAL from PE 0: ocean_model_restart was called with unused buoyancy fluxes. For conservation, the ocean restart files can only be created after the buoyancy forcing is applied.

Below is one of the settings,

DIABATIC_FIRST = False
DT = 1800.0
DT_THERM = 10800.0      ! 6*DT
THERMO_SPANS_COUPLING = True
DTBT_RESET_PERIOD = 10800.0

This issue does not occur with other ratios tested, including 1:1, 2:1, 4:1, 5:1, and 6:1.

Does anyone with extensive experience in this area have insights into the underlying causes of this problem?

aekiss commented 1 month ago

Just to add, the config uses ocn_cpl_dt = 3600 i.e. 2*DT, for the settings in the code box above.

aekiss commented 1 month ago

FATAL from PE 0: ocean_model_restart was called with unused buoyancy fluxes. For conservation, the ocean restart files can only be created after the buoyancy forcing is applied.

This message comes from the NUOPC cap when fluxes_used is false.

Looks like step_MOM_thermo is the only relevant place where fluxes_used is set to true.

Because we have DIABATIC_FIRST = False, step_MOM_thermo is only called here on (if I understand this correctly) the first dynamic timestep of a thermo timestep:

    if ((CS%t_dyn_rel_adv==0.0) .and. do_thermo .and. (.not.CS%diabatic_first)) then
gustavo-marques commented 1 month ago

We have not yet tested THERMO_SPANS_COUPLING = True in CESM. Does the error occur if you start from a restart file?

minghangli-uni commented 1 month ago

Thanks for the suggestion @gustavo-marques

I tried starting from a restart file, and there was no error.

But it’s still a bit unusual to me why this issue only occurs with this specific timestep ratio.

gustavo-marques commented 1 month ago

I suspect the issue is because of the lag applied in startup runs. For how long are you running the model? Does the issue go away if you add ocn_cpl_dt to the run length?

minghangli-uni commented 1 month ago

Thanks @gustavo-marques.

I suspect the issue is because of the lag applied in startup runs.

I’ve checked the ESMF log output, and the startup runs seem normal. I compared the logs of a crashed run and a successful one, and they are identical up to line 3889, where the crashed run stops.

Specific for this comparison,

tests dt (s) dt_therm (s) dt_cpl (s) Runtime
Crashed 1800 10800 (6*dt) 3600 (2*dt) 6 hours
Worked 1800 7200 (4*dt) 3600 (2*dt) 6 hours
3856 20240725 113304.688 INFO             PET01 (MOM_cap:ModelAdvance)------>Advancing OCN from: 1900  1  1  4  0  0   0
3857 20240725 113304.688 INFO             PET01 --------------------------------> to: 1900  1  1  5  0  0   0
3858 20240725 113305.849 INFO             PET01 (datm_comp_run) :ES: Faxa_lwdn  117.4051      307.0915      436860.3        2250
3859 20240725 113305.849 INFO             PET01 (datm_comp_run) :ES: Faxa_rainc  0.000000      0.000000      0.000000        2250
3860 20240725 113305.849 INFO             PET01 (datm_comp_run) :ES: Faxa_rainl  0.000000     0.5926946E-05 0.4093543E-04    2250
3861 20240725 113305.849 INFO             PET01 (datm_comp_run) :ES: Faxa_snowc  0.000000      0.000000      0.000000        2250
3862 20240725 113305.849 INFO             PET01 (datm_comp_run) :ES: Faxa_snowl  0.000000     0.1068611E-03 0.1294078E-01    2250
3863 20240725 113305.849 INFO             PET01 (datm_comp_run) :ES: Faxa_swdn  64.92579      795.0610      878697.0        2250
3864 20240725 113305.849 INFO             PET01 (datm_comp_run) :ES: Faxa_swndf  11.03738      135.1604      149378.5        2250
3865 20240725 113305.849 INFO             PET01 (datm_comp_run) :ES: Faxa_swndr  20.12699      246.4689      272396.1        2250
3866 20240725 113305.849 INFO             PET01 (datm_comp_run) :ES: Faxa_swnet  59.82733      732.6716      809604.7        2250
3867 20240725 113305.849 INFO             PET01 (datm_comp_run) :ES: Faxa_swvdf  15.58219      190.8146      210887.3        2250
3868 20240725 113305.849 INFO             PET01 (datm_comp_run) :ES: Faxa_swvdr  18.17922      222.6171      246035.2        2250
3869 20240725 113305.849 INFO             PET01 (datm_comp_run) :ES: Sa_dens  1.249043      1.357094      2947.415        2250
3870 20240725 113305.850 INFO             PET01 (datm_comp_run) :ES: Sa_pbot  98209.94      102343.4     0.2254037E+09    2250
3871 20240725 113305.850 INFO             PET01 (datm_comp_run) :ES: Sa_pslv  98209.94      102343.4     0.2254037E+09    2250
3872 20240725 113305.850 INFO             PET01 (datm_comp_run) :ES: Sa_ptem  262.2230      276.3961      598887.2        2250
3873 20240725 113305.850 INFO             PET01 (datm_comp_run) :ES: Sa_shum 0.7902362E-03 0.3459137E-02  5.017657        2250
3874 20240725 113305.850 INFO             PET01 (datm_comp_run) :ES: Sa_tbot  262.2230      276.3961      598887.2        2250
3875 20240725 113305.850 INFO             PET01 (datm_comp_run) :ES: Sa_u -11.65953      10.55894     -4872.498        2250
3876 20240725 113305.850 INFO             PET01 (datm_comp_run) :ES: Sa_v -6.302263      8.649424      3156.413        2250
3877 20240725 113305.850 INFO             PET01 (datm_comp_run) :ES: Sa_z  10.00000      10.00000      22500.00        2250
3878 20240725 113305.858 INFO             PET01 (med_phases_profile): done
3879 20240725 113305.878 INFO             PET01 (med_methods_FB_copy): called
3880 20240725 113305.891 INFO             PET01 (ice_comp_nuopc):(ModelAdvance)  called
3881 20240725 113305.891 INFO             PET01 (ice_comp_nuopc):(ModelAdvance) ------>Advancing ICE from: 1900  1  1  5  0  0   0
3882 20240725 113305.891 INFO             PET01 --------------------------------> to: 1900  1  1  6  0  0   0
3883 20240725 113305.892 INFO             PET01 ice_import tfrz_option = linear_salt, ktherm =        1
3884 20240725 113306.321 INFO             PET01 ice_export called
3885 20240725 113306.334 INFO             PET01 (drof_comp_run) :ES: Forr_rofi  0.000000      0.000000      0.000000        2250
3886 20240725 113306.334 INFO             PET01 (drof_comp_run) :ES: Forr_rofl  0.000000      0.000000      0.000000        2250
3887 20240725 113306.334 INFO             PET01 (MOM_cap:ModelAdvance)------>Advancing OCN from: 1900  1  1  5  0  0   0
3888 20240725 113306.334 INFO             PET01 --------------------------------> to: 1900  1  1  6  0  0   0
3889 20240725 113307.007 INFO             PET01 MOM_cap: Writing restart :  access-om3.mom6.r.1900-01-01-21600
3890 20240725 105500.943 INFO             PET01 (datm_comp_run) :ES: Faxa_lwdn  117.4488      310.7040      436511.8        2250
3891 20240725 105500.943 INFO             PET01 (datm_comp_run) :ES: Faxa_rainc  0.000000      0.000000      0.000000        2250
3892 20240725 105500.943 INFO             PET01 (datm_comp_run) :ES: Faxa_rainl  0.000000     0.1030642E-06 0.8495813E-05    2250
3893 20240725 105500.943 INFO             PET01 (datm_comp_run) :ES: Faxa_snowc  0.000000      0.000000      0.000000        2250
3894 20240725 105500.943 INFO             PET01 (datm_comp_run) :ES: Faxa_snowl  0.000000     0.9302876E-04 0.1242759E-01    2250
3895 20240725 105500.943 INFO             PET01 (datm_comp_run) :ES: Faxa_swdn  83.94517      772.3617      865575.6        2250
...
4159 20240725 105501.670 INFO             PET01 (med_phases_profile): done
4160 20240725 105504.995 INFO             PET01 (esm_finalize): called
4161 20240725 105504.995 INFO             PET01 (esm_finalize): done
4162 20240725 105505.003 INFO             PET01 ESMF_GridCompDestroy called
4163 20240725 105505.003 INFO             PET01 ESMF_GridCompDestroy finished
4164 20240725 105505.003 INFO             PET01 esmApp FINISHED
4165 20240725 105505.003 INFO             PET01 Finalizing ESMF

For how long are you running the model?

I’ve tested various run durations, including a couple of hours, days, months, and years. It appears that the run duration of the model does not relate to this issue.

Does the issue go away if you add ocn_cpl_dt to the run length?

Unfortunately, this doesn’t seem to help.