NOAA-GFDL / GFDL_atmos_cubed_sphere

The GFDL atmos_cubed_sphere dynamical core code
Other
57 stars 118 forks source link

sporadic floating point errors in a2b_edge.F90 for regional configurations #346

Closed SamuelTrahanNOAA closed 3 months ago

SamuelTrahanNOAA commented 4 months ago

Describe the bug

Regional configurations of UFS FV3 abort sporadically with a floating-point exception in subroutine a2b_ord2 in FV3/atmos_cubed_sphere/model/a2b_edge.F90 when compiled in debug mode. The crash is here:

    if (gridstruct%grid_type < 3) then

       if (gridstruct%bounded_domain) then

          do j=js-2,je+1+2   
             do i=is-2,ie+1+2
                qout(i,j) = 0.25*(qin(i-1,j-1)+qin(i,j-1)+qin(i-1,j)+qin(i,j)) ! <------- crashes here
             enddo
          enddo

       else
Full stack trace ``` 112: 112: WARNING from PE 112: atmos_modeldefine_blocks_packed: domain ( 33 19) is not an even divisor with definition ( 32) - blocks will not be uniform with a remainder of 19 112: 112: [h11c41:455655:0:455655] Caught signal 8 (Floating point exception: floating-point invalid operation) 112: ==== backtrace (tid: 455655) ==== 112: 0 0x00000000000534e9 ucs_debug_print_backtrace() ???:0 112: 1 0x0000000000012cf0 __funlockfile() :0 112: 2 0x0000000004ba5714 a2b_edge_mod_mp_a2b_ord2_() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/a2b_edge.F90:382 112: 3 0x0000000002bccce6 L_dyn_core_mod_mp_adv_pe__1630__par_loop0_2_108() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/dyn_core.F90:1665 112: 4 0x000000000013fbb3 __kmp_invoke_microtask() ???:0 112: 5 0x00000000000bbfac __kmp_fork_call() /nfs/site/proj/openmp/promo/20211013/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_runtime.cpp:2111 112: 6 0x000000000007dcb5 __kmpc_fork_call() /nfs/site/proj/openmp/promo/20211013/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_csupport.cpp:358 112: 7 0x0000000002bc674f dyn_core_mod_mp_adv_pe_() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/dyn_core.F90:1630 112: 8 0x0000000002b689ea dyn_core_mod_mp_dyn_core_() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/dyn_core.F90:1280 112: 9 0x0000000002ce48d4 fv_dynamics_mod_mp_fv_dynamics_() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/model/fv_dynamics.F90:683 112: 10 0x00000000028bd928 atmosphere_mod_mp_atmosphere_dynamics_() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90:683 112: 11 0x00000000020b079c atmos_model_mod_mp_update_atmos_model_dynamics_() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/atmos_model.F90:880 112: 12 0x0000000001b4014c module_fcst_grid_comp_mp_fcst_run_phase_1_() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/module_fcst_grid_comp.F90:1330 112: 13 0x0000000000aa2644 ESMCI::FTable::callVFuncPtr() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167 112: 14 0x0000000000aa61ef ESMCI_FTableCallEntryPointVMHop() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824 112: 15 0x000000000094dbea ESMCI::VMK::enter() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:1247 112: 16 0x000000000121eeaf ESMCI::VM::enter() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216 112: 17 0x0000000000aa3a8a c_esmc_ftablecallentrypointvm_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981 112: 18 0x0000000000970d50 esmf_compmod_mp_esmf_compexecute_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1252 112: 19 0x0000000000ca5351 esmf_gridcompmod_mp_esmf_gridcomprun_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1903 112: 20 0x0000000001b0b54e fv3atm_cap_mod_mp_modeladvance_phase1_() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/fv3_cap.F90:1077 112: 21 0x0000000001b0a615 fv3atm_cap_mod_mp_modeladvance_() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/FV3/fv3_cap.F90:1026 112: 22 0x00000000006aba58 ESMCI::MethodElement::execute() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:377 112: 23 0x00000000006ab9ba ESMCI::MethodTable::execute() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:563 112: 24 0x00000000006aa582 c_esmc_methodtableexecute_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:317 112: 25 0x000000000047c492 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1287 112: 26 0x0000000004e0e71d nuopc_modelbase_mp_routine_run_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/addon/NUOPC/src/NUOPC_ModelBase.F90:2212 112: 27 0x0000000000aa2644 ESMCI::FTable::callVFuncPtr() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167 112: 28 0x0000000000aa61ef ESMCI_FTableCallEntryPointVMHop() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824 112: 29 0x000000000094d9da ESMCI::VMK::enter() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2501 112: 30 0x000000000121eeaf ESMCI::VM::enter() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216 112: 31 0x0000000000aa3a8a c_esmc_ftablecallentrypointvm_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981 112: 32 0x0000000000970d50 esmf_compmod_mp_esmf_compexecute_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1252 112: 33 0x0000000000ca5351 esmf_gridcompmod_mp_esmf_gridcomprun_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1903 112: 34 0x00000000008d1317 nuopc_driver_mp_routine_executegridcomp_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3694 112: 35 0x00000000008d0b6a nuopc_driver_mp_executerunsequence_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3940 112: 36 0x00000000006aba58 ESMCI::MethodElement::execute() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:377 112: 37 0x00000000006ab9ba ESMCI::MethodTable::execute() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:563 112: 38 0x00000000006aa582 c_esmc_methodtableexecute_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:317 112: 39 0x000000000047c492 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1287 112: 40 0x00000000008cdbb2 nuopc_driver_mp_routine_run_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3615 112: 41 0x0000000000aa2644 ESMCI::FTable::callVFuncPtr() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167 112: 42 0x0000000000aa61ef ESMCI_FTableCallEntryPointVMHop() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824 112: 43 0x000000000094d9da ESMCI::VMK::enter() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2501 112: 44 0x000000000121eeaf ESMCI::VM::enter() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216 112: 45 0x0000000000aa3a8a c_esmc_ftablecallentrypointvm_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981 112: 46 0x0000000000970d50 esmf_compmod_mp_esmf_compexecute_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1252 112: 47 0x0000000000ca5351 esmf_gridcompmod_mp_esmf_gridcomprun_() /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-nnuwc5zlpvogeiuk3nec26eryjiwsopw/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1903 112: 48 0x000000000042fae6 MAIN__() /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-pr/update-upp-20240612/number-concentration/driver/UFS.F90:406 112: 49 0x000000000042bfa2 main() ???:0 112: 50 0x000000000003ad85 __libc_start_main() ???:0 112: 51 0x000000000042beae _start() ???:0 112: ================================= 112: forrtl: error (75): floating point exception 112: Image PC Routine Line Source 112: fv3.exe 000000000C1EE34B Unknown Unknown Unknown 112: libpthread-2.28.s 0000150AC4D0BCF0 Unknown Unknown Unknown 112: fv3.exe 0000000004BA5714 a2b_edge_mod_mp_a 382 a2b_edge.F90 112: fv3.exe 0000000002BCCCE6 dyn_core_mod_mp_a 1665 dyn_core.F90 112: libiomp5.so 0000150AC7D74BB3 __kmp_invoke_micr Unknown Unknown 112: libiomp5.so 0000150AC7CF0FAC __kmp_fork_call Unknown Unknown 112: libiomp5.so 0000150AC7CB2CB5 __kmpc_fork_call Unknown Unknown 112: fv3.exe 0000000002BC674F dyn_core_mod_mp_a 1630 dyn_core.F90 112: fv3.exe 0000000002B689EA dyn_core_mod_mp_d 1280 dyn_core.F90 112: fv3.exe 0000000002CE48D4 fv_dynamics_mod_m 683 fv_dynamics.F90 112: fv3.exe 00000000028BD928 atmosphere_mod_mp 683 atmosphere.F90 112: fv3.exe 00000000020B079C atmos_model_mod_m 880 atmos_model.F90 112: fv3.exe 0000000001B4014C module_fcst_grid_ 1330 module_fcst_grid_comp.F90 112: fv3.exe 0000000000AA2644 Unknown Unknown Unknown 112: fv3.exe 0000000000AA61EF Unknown Unknown Unknown 112: fv3.exe 000000000094DBEA Unknown Unknown Unknown 112: fv3.exe 000000000121EEAF Unknown Unknown Unknown 112: fv3.exe 0000000000AA3A8A Unknown Unknown Unknown 112: fv3.exe 0000000000970D50 Unknown Unknown Unknown 112: fv3.exe 0000000000CA5351 Unknown Unknown Unknown 112: fv3.exe 0000000001B0B54E fv3atm_cap_mod_mp 1077 fv3_cap.F90 112: fv3.exe 0000000001B0A615 fv3atm_cap_mod_mp 1026 fv3_cap.F90 112: fv3.exe 00000000006ABA58 Unknown Unknown Unknown 112: fv3.exe 00000000006AB9BA Unknown Unknown Unknown 112: fv3.exe 00000000006AA582 Unknown Unknown Unknown 112: fv3.exe 000000000047C492 Unknown Unknown Unknown 112: fv3.exe 0000000004E0E71D Unknown Unknown Unknown 112: fv3.exe 0000000000AA2644 Unknown Unknown Unknown 112: fv3.exe 0000000000AA61EF Unknown Unknown Unknown 112: fv3.exe 000000000094D9DA Unknown Unknown Unknown 112: fv3.exe 000000000121EEAF Unknown Unknown Unknown 112: fv3.exe 0000000000AA3A8A Unknown Unknown Unknown 112: fv3.exe 0000000000970D50 Unknown Unknown Unknown 112: fv3.exe 0000000000CA5351 Unknown Unknown Unknown 112: fv3.exe 00000000008D1317 Unknown Unknown Unknown 112: fv3.exe 00000000008D0B6A Unknown Unknown Unknown 112: fv3.exe 00000000006ABA58 Unknown Unknown Unknown 112: fv3.exe 00000000006AB9BA Unknown Unknown Unknown 112: fv3.exe 00000000006AA582 Unknown Unknown Unknown 112: fv3.exe 000000000047C492 Unknown Unknown Unknown 112: fv3.exe 00000000008CDBB2 Unknown Unknown Unknown 112: fv3.exe 0000000000AA2644 Unknown Unknown Unknown 112: fv3.exe 0000000000AA61EF Unknown Unknown Unknown 112: fv3.exe 000000000094D9DA Unknown Unknown Unknown 112: fv3.exe 000000000121EEAF Unknown Unknown Unknown 112: fv3.exe 0000000000AA3A8A Unknown Unknown Unknown 112: fv3.exe 0000000000970D50 Unknown Unknown Unknown 112: fv3.exe 0000000000CA5351 Unknown Unknown Unknown 112: fv3.exe 000000000042FAE6 MAIN__ 406 UFS.F90 112: fv3.exe 000000000042BFA2 Unknown Unknown Unknown 112: libc-2.28.so 0000150AC4756D85 __libc_start_main Unknown Unknown 112: fv3.exe 000000000042BEAE Unknown Unknown Unknown ```

The crash is a floating-point exception. There are only additions and multiplications, so the exception is probably from a NaN. This could be due to uninitialized memory, or due to not filling boundary conditions (which are initialized with signalling NaN).

Crashes seem to start after #344 was merged. If so, that PR shouldn't have been merged; the regression test system should've detected this problem. Unfortunately, the ufs-weather-model regression test system is presently unable to detect the difference between a crash and a test's results changing. A fix for the regression test system bug is being tested now.

Unfortunately, we're stuck with broken authoritative branches until this bug is fixed.

From skimming the changes in #344, my best guess is that some parts of the omga array are uninitialized for regional cases due to removing the initialization loop. I haven't had a chance to test that hypothesis yet.

To Reproduce

  1. Set up on Hera the ufs-weather-model regression test system to not retry jobs, and not delete logs or run directories.
  2. Run all ufs-weather-model regression tests that have both "conus13km" and "debug" in their name.
  3. Check for floating point exceptions in failed tests before the regression test system deletes the logs.

The fix for the regression test system is in this PR:

That is being tested now. Once it's merged, model crashes will be detectable in regression tests once again.

Expected behavior Model runs to completion when compiled in debug mode.

System Environment UFS Weather Model regression test system with Intel compiler on Hera. That's Intel 2021.5.0 with IMPI 2021.5.1 and FMS 2023.04 using Spack Stack 1.6.0.

Here's the uname -a output from a login node:

Linux hfe09 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Wed Sep 20 15:55:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Additional context Can't think of anything.

jkbk2004 commented 4 months ago

@SamuelTrahanNOAA I am seeing same bug behavior with https://github.com/ufs-community/ufs-weather-model/pull/2362. It points to #344

lharris4 commented 4 months ago

Hi, all. This PR only makes changes to the diagnostic omega and not to any of the prognostic variables so there is probably some loop index or bounds error. Does debug mode include bounds checking? Also check the subroutine calls closely to make sure array bounds are set up correctly.

Thanks, Lucas

On Thu, Jul 11, 2024 at 1:29 PM JONG KIM @.***> wrote:

@SamuelTrahanNOAA https://github.com/SamuelTrahanNOAA I am seeing same bug behavior with ufs-community/ufs-weather-model#2362 https://github.com/ufs-community/ufs-weather-model/pull/2362.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/issues/346#issuecomment-2223500223, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMUQRVHY2RGSRTZJVBAMHETZL26HRAVCNFSM6AAAAABKXL6OR6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRTGUYDAMRSGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

SamuelTrahanNOAA commented 4 months ago

Yes, debug mode includes bounds checking and tests for various floating-point errors.

In most cases, the dynamical core fails in debug mode in regional tests, where it aborts due to floating point exceptions. Many debug tests are already disabled because of existing unknown bugs. We really can't afford to disable the remaining tests due to new bugs.

lharris4 commented 4 months ago

I am confused now. Is this in the global-nested configuration (as the title suggests), or only in regional domains?

On Thu, Jul 11, 2024 at 2:04 PM Samuel Trahan (NOAA contractor) < @.***> wrote:

Yes, debug mode includes bounds checking and tests for various floating-point errors.

In most cases, the dynamical core fails in debug mode in regional tests with floating point exceptions. Many debug tests are already disabled because of existing unknown bugs. We really can't afford to disable the remaining tests due to new bugs.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/issues/346#issuecomment-2223559704, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMUQRVBUCTG2Y25W7FJZIO3ZL3CJDAVCNFSM6AAAAABKXL6OR6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRTGU2TSNZQGQ . You are receiving this because you commented.Message ID: @.***>

SamuelTrahanNOAA commented 4 months ago

I am confused now. Is this in the global-nested configuration (as the title suggests), or only in regional domains?

Regional configurations. I've corrected the title; sorry about that.

SamuelTrahanNOAA commented 4 months ago

These are the last three tests that have failed for me:

The only commonalities I see are:

jkbk2004 commented 4 months ago

@SamuelTrahanNOAA 's list is correct. I am still running on Derecho and Gaea to make sure again. If https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/commit/577fd5e487bb01d20cc44c84741f5b1d24e9c4ab is going to be reverted, then I can turn off https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/rt.conf#L35-L36

XiaqiongZhou-NOAA commented 4 months ago

I am a little confused that why https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/pull/344 is the issue. To clarify: First of all: https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/rt.conf#L35-L36 is nothing related to https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/commit/577fd5e487bb01d20cc44c84741f5b1d24e9c4ab. Second, https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/pull/344 does not change the result. Is this commit causing the model crash?

I do not think to revert dycore update is an option. We need this update for GFSv17/GEFSv13. It is better to identify what is really causing the problem.

jkbk2004 commented 4 months ago

@XiaqiongZhou-NOAA my mistake! https://github.com/ufs-community/ufs-weather-model/pull/2327 doesn't have a test case changed. @lharris4 @SamuelTrahanNOAA https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/commit/577fd5e487bb01d20cc44c84741f5b1d24e9c4ab can be reverted w/o any change on UFS-WM level.

SamuelTrahanNOAA commented 4 months ago

Second, https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/pull/344 does not change the result.

It doesn't change the result when the job succeeds.

The problem is that the job doesn't succeed reliably, after #344 is merged.

SamuelTrahanNOAA commented 4 months ago

First of all: https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/rt.conf#L35-L36 is nothing related to https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/commit/577fd5e487bb01d20cc44c84741f5b1d24e9c4ab.

I don't know why @jkbk2004 mentioned that test, but it is not one of the ones that is failing for me.

jkbk2004 commented 4 months ago

First of all: https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/rt.conf#L35-L36 is nothing related to 577fd5e.

I don't know why @jkbk2004 mentioned that test, but it is not one of the ones that is failing for me.

@SamuelTrahanNOAA I was confused. @lharris4 @laurenchilutti @bensonr Can we make a decision to revert https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/commit/577fd5e487bb01d20cc44c84741f5b1d24e9c4ab ?

SamuelTrahanNOAA commented 4 months ago

@XiaqiongZhou-NOAA These are the only tests that fail for me:

I explain in detail in my comment https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/issues/346#issuecomment-2223583744

laurenchilutti commented 4 months ago

If you would like this reverted, we should do it via a PR so you can rerun the UFS tests. If Lucas and Rusty agree, I can put in a PR with this Merge being reverted for you to test.

XiaqiongZhou-NOAA commented 4 months ago

If you would like this reverted, we should do it via a PR so you can rerun the UFS tests. If Lucas and Rusty agree, I can put in a PR with this Merge being reverted for you to test.

Lauren: Please hold this.

@XiaqiongZhou-NOAA These are the only tests that fail for me:

  • conus13km_debug_intel
  • conus13km_debug_2threads_intel
  • hafs_regional_storm_following_1nest_atm_ocn_debug_intel

I explain in detail in my comment #346 (comment)

@SamuelTrahanNOAA I am running these tests OK on Hercules. How to repeat your failed cases? What else need changed?

SamuelTrahanNOAA commented 4 months ago

I am running these tests OK on Hercules. How to repeat your failed cases? What else need changed?

Try running on HERA. I haven't tested this on Hercules, so I don't know if it'll fail there. Uninitialized memory and out-of-bounds accesses can be troublesome like that. Change one little thing, and the contents of that memory are different.

jkbk2004 commented 4 months ago

@SamuelTrahanNOAA I ran on hera/hercules/gaea/derecho. It's random behavior but those 3 cases are commonly crashing same line 382 of atmos_cubed_sphere/model/a2b_edge.F90.

SamuelTrahanNOAA commented 4 months ago

I've tried two changes:

  1. Default pass_full_omega_to_physics_in_non_hydrostatic_mode to .true. With this change, hafs_regional_storm_following_1nest_atm_ocn_debug_intel failed the first try in a2b_edge.F90. The other two tests succeeded on the first try (but the results changed).
  2. Restore the initialization loop on line 826 which sets omga(i,j,k) = delp(i,j,k)/delz(i,j,k)*w(i,j,k). With this change, all three tests fail reliably in the usual way.

EDIT: Updated comment to reflect that in item 1, the results changed for the two jobs that ran to completion.

DusanJovic-NOAA commented 4 months ago

@SamuelTrahanNOAA Can you try this change in a2b_edge.F90

diff --git a/model/a2b_edge.F90 b/model/a2b_edge.F90
index c4530a1..0c5de7e 100644
--- a/model/a2b_edge.F90
+++ b/model/a2b_edge.F90
@@ -377,8 +377,8 @@ contains

        if (gridstruct%bounded_domain) then

-          do j=js-2,je+1+2
-             do i=is-2,ie+1+2
+          do j=js,je+1
+             do i=is,ie+1
                 qout(i,j) = 0.25*(qin(i-1,j-1)+qin(i,j-1)+qin(i-1,j)+qin(i,j))
              enddo
           enddo
diff --git a/model/dyn_core.F90 b/model/dyn_core.F90
index 15df82f..f469e30 100644
--- a/model/dyn_core.F90
+++ b/model/dyn_core.F90
@@ -166,6 +166,12 @@ public :: dyn_core, del2_cubed, init_ijk_mem
   integer :: kmax=1
   real, parameter    ::     rad2deg = 180./pi

+#ifdef OVERLOAD_R4
+  real, parameter:: real_snan=real(Z'FFBFFFFF')
+#else
+  real, parameter:: real_snan=real(Z'FFF7FFFFFFFFFFFF')
+#endif
+
 contains

 !-----------------------------------------------------------------------
@@ -1627,6 +1633,9 @@ integer :: is,  ie,  js,  je
       js  = bd%js
       je  = bd%je

+      pin = real_snan
+      pb = real_snan
+
 !$OMP parallel do default(none) shared(is,ie,js,je,npz,ua,va,gridstruct,pem,npx,npy,ng,om) &
 !$OMP                          private(n, pdx, pdy, pin, pb, up, vp, grad, v3)
 do k=1,npz
SamuelTrahanNOAA commented 4 months ago

Dusan's fix worked for me. All three jobs succeeded the first time. Can other people confirm it works for them?

jkbk2004 commented 4 months ago

Dusan's fix worked for me. All three jobs succeeded the first time. Can other people confirm it works for them?

@SamuelTrahanNOAA let me test on gaea/hercules/hera.

jkbk2004 commented 4 months ago

All those cases pass ok Hera/Hercules/Gaea/Derecho.

PASS -- TEST 'conus13km_debug_intel' [17:58, 14:25](1242 MB)
PASS -- TEST 'conus13km_debug_qr_intel' [17:58, 14:48](919 MB)
PASS -- TEST 'conus13km_debug_2threads_intel' [10:53, 08:11](1165 MB)
PASS -- TEST 'hafs_regional_storm_following_1nest_atm_ocn_debug_intel' [19:03, 13:08](563 MB)

@DusanJovic-NOAA @SamuelTrahanNOAA Will you create PR ?

SamuelTrahanNOAA commented 4 months ago

I'd rather not do it since this is neither my fix nor my code, and I have too much going on already.

@DusanJovic-NOAA - Can you do the PR?

DusanJovic-NOAA commented 4 months ago

I'd rather not do it since this is neither my fix nor my code, and I have too much going on already.

@DusanJovic-NOAA - Can you do the PR?

Opened PR #349

bensonr commented 3 months ago

PR #349 merged into dev/emc