COSIMA / access-om3

ACCESS-OM3 global ocean-sea ice-wave coupled model
13 stars 6 forks source link

Sudden crashes with excessive SSH #55

Open aekiss opened 1 year ago

aekiss commented 1 year ago

MOM6-CICE6 1° configs are crashing after running for several weeks/months. Excessively large SSH appears in less than 1 day, without unusual wind stress - see https://github.com/COSIMA/MOM6-CICE6/pull/5#issuecomment-1665023553 https://github.com/COSIMA/MOM6-CICE6/pull/5#issuecomment-1676484356

aekiss commented 11 months ago

The latest commit on the 1deg_jra55do_ryf branch of MOM6-CICE6 crashes after model date = 0001-10-12T00:00:00 with

WARNING from PE    31: Extreme surface sfc_state detected: i= 329 j= 194 lon=  48.500 lat=  26.524 x=  48.500 y
=  26.524 D= 1.1806E+01 SSH= 1.0551E+01 SST= 2.5600E+01 SSS= 4.5001E+01 U-= 0.0000E+00 U+=-1.0853E-02 V-= 0.000
0E+00 V+= 7.3564E-03

This is at the head of the Persian Gulf. This crash seems nearly identical to the previous test (same location and date, nearly the same SSH): https://github.com/COSIMA/MOM6-CICE6/pull/5#issuecomment-1676484356. Run dir: /home/156/aek156/payu/MOM6-CICE6-1deg_jra55do_ryf

aekiss commented 11 months ago

changing DTBT from -0.95 to -0.5 (roughly halving barotropic timestep) makes no difference

WARNING from PE    31: Extreme surface sfc_state detected: i= 329 j= 194 lon=  48.500 lat=  26.524 x=  48.500 y=  26.524 D= 1.1806E+01 SSH= 1.0589E+01 SST= 2.5610E+01 SSS= 4.5001E+01 U-= 0.0000E+00 U+=-1.0604E-02 V-= 0.0000E+00 V+= 6.4269E-03
aekiss commented 11 months ago

Also crashes identically with the latest ACCESS-OM3 commit 377c1fc (unsurprising, as this just adds the GPTL timing library).

aekiss commented 10 months ago

The 1deg_jra55do_ryf and 1deg_jra55do_iaf configs of MOM6-CICE6 run happily with more lenient surface checks, using values from mom6-om4-025/MOM_input (RH column) instead of defaults (LH column):

Variable archive/
output008/
MOM_parameter_doc.all
archive/
output009/
MOM_parameter_doc.all
bad_val_ssh_max 20.0 50.0
bad_val_sss_max 45.0 75.0
bad_val_sst_max 45.0 55.0
bad_val_sst_min -2.1 -3.0
ezhilsabareesh8 commented 9 months ago

MOM6-CICE6-WWIII configuration crashes at the same location as MOM6-CICE6 after running MOM Date 1/10/08 00:00:00, The SSH and SST limit mentioned above is not implemented yet.

WARNING from PE 31: Extreme surface sfc_state detected: i= 329 j= 194 lon= 48.500 lat= 26.524 x= 48.500 y= 26.524 D= 1.1806E+01 SSH= 1.0461E+01 SST= 2.5886E+01 SSS= 4.5002E+01 U-= 0.0000E+00 U+=-4.0502E-02 V-= 0.0000E+00 V+=-2.2684E-02

aekiss commented 6 months ago

Using more lenient checks from from mom6-om4-025 allows the MOM6-CICE6 1° run to proceed for at least 2 years with no issues.

access-hive-bot commented 6 months ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/namelist-configuration-discussion-meeting/1917/9

aekiss commented 4 months ago

Maybe fixing this will help? https://github.com/COSIMA/access-om3/issues/164

ezhilsabareesh8 commented 4 months ago

Maybe fixing this will help? #164

Thanks @aekiss. I am currently experiencing crashes in the MOM6-CICE6 1 deg IAF and RYF configs (main branch) after a few months of runtime (3-4 months). Each failure appears to be due to different reasons (I have listed a few error logs below).

Test experiment 1 - IAF 1deg

WARNING from PE     0: diag_util_mod::opening_file: module/field_name (ocean_model_z/N2_int) NOT registered
WARNING from PE     0: diag_util_mod::opening_file: module/field_name (ocean_model_z/N2_int) NOT registered
WARNING from PE     0: diag_util_mod::opening_file: module/field_name (ocean_model_z/N2_int) NOT registered

Image              PC                Routine            Line        Source
libpthread-2.28.s  00001540FACF5CF0  Unknown               Unknown  Unknown
access-om3-MOM6-C  00000000039D6EBC  diag_manager_mod_        3234  diag_manager.F90
access-om3-MOM6-C  00000000039BDF18  diag_manager_mod_        1466  diag_manager.F90
access-om3-MOM6-C  000000000350D06E  mom_diag_manager_         348  MOM_diag_manager_infra.F90
access-om3-MOM6-C  0000000003095BAE  mom_diag_mediator        1784  MOM_diag_mediator.F90
access-om3-MOM6-C  0000000003094202  mom_diag_mediator        1625  MOM_diag_mediator.F90
access-om3-MOM6-C  00000000035A1AAD  mom_dynamics_spli        1051  MOM_dynamics_split_RK2.F90
access-om3-MOM6-C  0000000002E49A33  mom_mp_step_mom_d        1173  MOM.F90
access-om3-MOM6-C  0000000002E4058B  mom_mp_step_mom_          853  MOM.F90
access-om3-MOM6-C  0000000002E1496D  mom_ocean_model_n         633  mom_ocean_model_nuopc.F90
access-om3-MOM6-C  0000000002D3505D  mom_cap_mod_mp_mo        1759  mom_cap.F90

Test experiment 2 - RYF 1 deg

WARNING from PE     0: diag_util_mod::opening_file: module/field_name (ocean_model_z/N2_int) NOT registered
WARNING from PE     0: diag_util_mod::opening_file: module/field_name (ocean_model_z/N2_int) NOT registered
WARNING from PE     0: diag_util_mod::opening_file: module/field_name (ocean_model_z/N2_int) NOT registered

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libpthread-2.28.s  00001482944AFCF0  Unknown               Unknown  Unknown
access-om3-MOM6-C  00000000037C6D16  mom_vert_friction        1713  MOM_vert_friction.F90
access-om3-MOM6-C  0000000003597CC4  mom_dynamics_spli         581  MOM_dynamics_split_RK2.F90
access-om3-MOM6-C  0000000002E49A33  mom_mp_step_mom_d        1173  MOM.F90
access-om3-MOM6-C  0000000002E4058B  mom_mp_step_mom_          853  MOM.F90
access-om3-MOM6-C  0000000002E1496D  mom_ocean_model_n         633  mom_ocean_model_nuopc.F90
access-om3-MOM6-C  0000000002D3505D  mom_cap_mod_mp_mo        1759  mom_cap.F90
access-om3-MOM6-C  00000000020A73BF  _ZNK5ESMCI13Metho         377  ESMCI_MethodTable.C
access-om3-MOM6-C  00000000020A7338  _ZN5ESMCI11Method         563  ESMCI_MethodTable.C
access-om3-MOM6-C  00000000020A5DBB  c_esmc_methodtabl         317  ESMCI_MethodTable.C
access-om3-MOM6-C  0000000000DFD539  esmf_attachmethod        1287  ESMF_AttachMethods.F90
access-om3-MOM6-C  0000000004B83C92  nuopc_modelbase_m        2212  NUOPC_ModelBase.F90
aekiss commented 3 months ago

@ezhilsabareesh8 is this crashing even with more lenient checks?

ezhilsabareesh8 commented 3 months ago

@ezhilsabareesh8 is this crashing even with more lenient checks?

Thanks @aekiss. With the recent changes of setting Z_INIT_REMAP_GENERAL = True and MAX_DELTA_SRESTORE = 999.0, the 1-degree MOM6-CICE6 IAF configuration is now running for 3 years without crashing, even without lenient checks.

ezhilsabareesh8 commented 3 months ago

Test experiment 2 - RYF 1 deg forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source libpthread-2.28.s 00001482944AFCF0 Unknown Unknown Unknown access-om3-MOM6-C 00000000037C6D16 mom_vert_friction 1713 MOM_vert_friction.F90 access-om3-MOM6-C 0000000003597CC4 mom_dynamics_spli 581 MOM_dynamics_split_RK2.F90

The RYF 1-degree MOM6-CICE6 configuration still crashes with the above error. However, there is a significant difference between the MOM_input of the IAF and RYF configurations, which may be causing the error in RYF but not in IAF. The IAF MOM_input is outdated and needs to be updated.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

variable | MOM_input_one_deg_RYF | MOM_input_one_deg_IAF -- | -- | -- adjust_net_srestore_to_zero |   | TRUE ah_vel_scale |   | 0 bbl_use_eos |   | TRUE bt_thick_scheme |   | FROM_BT_CONT cfc_bc_file |   | cfc_atm_20230310.nc coord_config |   | none debug |   | FALSE default_2018_answers |   | FALSE depth_scaled_khth |   | FALSE energysavedays |   | 1 eqn_of_state |   | WRIGHT fatal_unused_params | TRUE |   fix_ustar_gustless_bug |   | TRUE gill_equatorial_ld |   | TRUE grid_rotation_angle_bugs |   | FALSE hmix_min |   | 2 int_tide_decay_scale |   | 300.3003003003003 interp_type2 |   | LMD94 interpolate_res_fn |   | FALSE kappa_shear_all_layer_tke_bug |   | FALSE kappa_shear_iter_bug |   | FALSE kdml |   | 0 kh_vel_scale |   | 0 khth |   | 0 khth_max |   | 0 khtr_max |   | 0 mask_srestore_under_ice |   | FALSE max_ent_it |   | 20 max_rino_it |   | 25 maxtrunc |   | 0 min_salinity |   | 0 nihalo |   | 4 njhalo |   | 4 prandtl_turb |   | 1 remap_uv_using_old_alg |   | FALSE simple_tke_to_kd |   | TRUE smag_bi_const |   | 0.06 tolerance_ent |   | 1e-05 topo_file |   | topog.nc use_cfc_cap |   | FALSE use_contemp_abssal |   | FALSE use_gm_work_bug |   | FALSE use_land_mask_for_hvisc |   | TRUE use_psurf_in_eos |   | TRUE visc_res_scale_coef |   | 0.4 z_init_file_salt_var |   | salt z_init_remap_old_alg |   | FALSE