Closed JustinPerket closed 2 months ago
Are you running with any openmp threads? Are you calling monin_obukhov_solve_zeta from an openmp region? Is this crash repeatable (it fails the same way every time?
@JustinPerket The debug mode for FMS's CMake build was recently added, it's mainly for allowing the person building to set custom flags (it'll compile with what is set in the CFLAGS and FCFLAGS environment variables). It doesn't add any flags on its own, just sets them directly to CFLAGS and FCFLAGS.
With this build, it's not adding the automatically added flags since it's set to debug so then it's failing to compile because its missing needed flags for the r4/r8 libraries which results in an arg mismatch in those interface calls. I would compile with the other build type (Release) if compiling with r4/r8, the debug can be used but flags will need to be manually set so you would have to do a separate compile for the r4/r8 libaries and also add in any debug flags via the environment variables.
We could potentially make this behave more standard-ly, and just have it add in expected debug flags.
I should probably back up and say the top, main repo for this project is: https://github.com/JustinPerket/ufs-weather-model/tree/feature/LM4 LM4 is a submodule in it, with a NUOPC/ESMF driver.
And here is the temp branch of the LM4 NUOPC driver where I'm implementing some of the FMS surface boundary layer functionality into it: https://github.com/JustinPerket/lm4/tree/fix/boundary_layer
Here is an even more temporary branch that reproduces the error: https://github.com/JustinPerket/ufs-weather-model/tree/TMP/debug_modrag_crash
It's checked out on hera at /scratch2/GFDL/gfdlscr/Justin.Perket/UFSmodels/ufs-LM4-foo
And a model run generated with cd tests && ./rt.sh -k -l lm4_tests.conf
produces the error.
A more full trace of one of the threads is:
4: ==== backtrace (tid: 66251) ====
4: 0 0x000000000004d455 ucs_debug_print_backtrace() ???:0
4: 1 0x00000000033bbc28 monin_obukhov_inter_mp_monin_obukhov_solve_zeta_() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/fms-noaa-gfdl-2022.04/monin_obukhov/monin_obukhov_inter.F90:267
4: 2 0x00000000033ba56a monin_obukhov_inter_mp_monin_obukhov_drag_1d_() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/fms-noaa-gfdl-2022.04/monin_obukhov/monin_obukhov_inter.F90:181
4: 3 0x0000000002f43a7a monin_obukhov_mod_mp_mo_drag_1d_() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/fms-noaa-gfdl-2022.04/monin_obukhov/monin_obukhov.F90:214
4: 4 0x0000000000483d12 lm4_surface_flux_mod_mp_lm4_surface_flux_1d_() /scratch2/GFDL/gfdlscr/Justin.Perket/UFSmodels/ufs-LM4/LM4-interface/LM4/nuopc_cap/lm4_surface_flux.F90:351
4: 5 0x0000000000449cc4 lm4_driver_mp_sfc_boundary_layer_() /scratch2/GFDL/gfdlscr/Justin.Perket/UFSmodels/ufs-LM4/LM4-interface/LM4/nuopc_cap/lm4_driver.F90:562
4: 6 0x0000000000430d47 lm4_cap_mod_mp_modeladvance_() /scratch2/GFDL/gfdlscr/Justin.Perket/UFSmodels/ufs-LM4/LM4-interface/LM4/nuopc_cap/lm4_cap.F90:452
4: 7 0x000000000167c82f ESMCI::MethodElement::execute() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
4: 8 0x000000000167c7b2 ESMCI::MethodTable::execute() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
4: 9 0x000000000167acb6 c_esmc_methodtableexecute_() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
4: 10 0x000000000105af92 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:128\
4: 11 0x000000000130ce50 nuopc_modelbase_mp_routine_run_() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/addon/NUOPC/src/NUOPC_ModelBase.F90:2220
4: 12 0x000000000111bfd4 ESMCI::FTable::callVFuncPtr() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_FTable.C:2167
4: 13 0x000000000111ff56 ESMCI_FTableCallEntryPointVMHop() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_FTable.C:824
4: 14 0x00000000018a364f ESMCI::VMK::enter() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2308
4: 15 0x000000000183211a ESMCI::VM::enter() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Infrastructure/VM/src/ESMCI_VM.C:1216
4: 16 0x000000000111d667 c_esmc_ftablecallentrypointvm_() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_FTable.C:981
4: 17 0x000000000102f90d esmf_compmod_mp_esmf_compexecute_() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMF_Comp.F90:1222
4: 18 0x00000000013368b6 esmf_gridcompmod_mp_esmf_gridcomprun_() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMF_GridComp.F90:1891
4: 19 0x0000000000fc5397 nuopc_driver_mp_routine_executegridcomp_() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/addon/NUOPC/src/NUOPC_Driver.F90:3329
4: 20 0x0000000000fc4bec nuopc_driver_mp_executerunsequence_() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/addon/NUOPC/src/NUOPC_Driver.F90:3622
4: 21 0x000000000167c82f ESMCI::MethodElement::execute() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
4: 22 0x000000000167c7b2 ESMCI::MethodTable::execute() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
4: 23 0x000000000167acb6 c_esmc_methodtableexecute_() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
4: 24 0x000000000105af92 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:128\
4: 25 0x0000000000fc1542 nuopc_driver_mp_routine_run_() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/addon/NUOPC/src/NUOPC_Driver.F90:3250
4: 26 0x000000000111bfd4 ESMCI::FTable::callVFuncPtr() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_FTable.C:2167
4: 27 0x000000000111ff56 ESMCI_FTableCallEntryPointVMHop() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_FTable.C:824
4: 28 0x00000000018a364f ESMCI::VMK::enter() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2308
4: 29 0x000000000183211a ESMCI::VM::enter() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Infrastructure/VM/src/ESMCI_VM.C:1216
4: 30 0x000000000111d667 c_esmc_ftablecallentrypointvm_() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_FTable.C:981
4: 31 0x000000000102f90d esmf_compmod_mp_esmf_compexecute_() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMF_Comp.F90:1222
4: 32 0x00000000013368b6 esmf_gridcompmod_mp_esmf_gridcomprun_() /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMF_GridComp.F90:1891
4: 33 0x000000000041bd70 MAIN__() /scratch2/GFDL/gfdlscr/Justin.Perket/UFSmodels/ufs-LM4/driver/UFS.F90:398
4: 34 0x0000000000418822 main() ???:0
4: 35 0x0000000000022555 __libc_start_main() ???:0
4: 36 0x0000000000418729 _start() ???:0
Within my top LM4 routine, a modified version of sfc_boundary_layer is called, which calls a very lightly modified version of surface_flux_1d
After that, it's using FMS modules, so when mo-drag is called, it's from monin_obukhov_mod
.
though the source path in the trace is no longer present, it looks like: https://github.com/NOAA-GFDL/FMS/tree/2022.04/monin_obukhov
Are you running with any openmp threads? Are you calling monin_obukhov_solve_zeta from an openmp region? Is this crash repeatable (it fails the same way every time?
I'm not sure. There is openmp threading in UFS enabled by default. As far as I'm aware, I'm not explicitly building or using it. There is a UFS build option -DOPENMP=OFF
which I tried.
@rem1776
@JustinPerket The debug mode for FMS's CMake build was recently added, it's mainly for allowing the person building to set custom flags (it'll compile with what is set in the CFLAGS and FCFLAGS environment variables). It doesn't add any flags on its own, just sets them directly to CFLAGS and FCFLAGS.
Ok, that's what I thought it looked like it was doing.
With this build, it's not adding the automatically added flags since it's set to debug so then it's failing to compile because its missing needed flags for the r4/r8 libraries which results in an arg mismatch in those interface calls. I would compile with the other build type (Release) if compiling with r4/r8, the debug can be used but flags will need to be manually set so you would have to do a separate compile for the r4/r8 libaries and also add in any debug flags via the environment variables.
ok, this may be a red herring then. I was hoping it would give some insight on why this crash only occurs with UFS compiled with it's debug flag.
@JustinPerket It could be that the debug flags are catching a divide by zero that's happening in both (release & debug) runs.
In standard fortran you can divide real values by zero without an error, you would just get an infinite value as a result. The debug build is adding the -ftrapuv
flag which causes an error on floating divide by zeros instead.
@rem1776 Ahh, thanks! I didn't know that
In that case, I t's most likely something is wrong with the arguments to mo_drag. Though from a debugger and write statements, they seem sensible. I'll dig into it more using my release build of FMS 2022.04
So I'm unable to replicate the error produced by the FMS 2022.04 module on hera or gaea with my own build of FMS.
Like I said before, the crash using the FMS module that UFS uses seems to be at rzeta = 1.0/zeta
in the subroutine monin_obukhov_solve_zeta
:
where (mask_1)
rzeta = 1.0/zeta
zeta_0 = zeta/z_z0
zeta_t = zeta/z_zt
zeta_q = zeta/z_zq
end where
but checking values for r_zeta and related variables with my build of FMS 2022.04 all seem fine. No Nans, Infs, and no values anywhere close to cause a divide by zero error.
Inputs to it continue to seem fine to me, and inputs & outputs to its parent subroutine monin_obukhov_drag_1d
also seem fine after it's called.
In standard fortran you can divide real values by zero without an error, you would just get an infinite value as a result. The debug build is adding the
-ftrapuv
flag which causes an error on floating divide by zeros instead.
I also tried adding --ftrapuv
/-fpe0
to my Release build of FMS, which didn't seem to catch anything.
UFS's DEBUG mode adds flags -O0 -check -check noarg_temp_created -check nopointer -warn -warn noerrors -fp-stack-check -fstack-protector-all -fpe0 -debug -ftrapuv -init=snan,arrays"
to the UFS build.
I'm still unsure why that would cause the crash in pre-built FMS code when it use running out of the box, using the FMS module on Gaea or Hera.
For the benefit of anyone searching for a solution to a similar problem:
When FMS is built with -O2
optimization flags or higher, the calculations inside the where
clause in monin_obukhov_solve_zeta
are speculatively executed without regard for which indices satisfy the masking condition, and in particular, calculations are performed for indices where division by zero occurs. As long as floating point exceptions are disabled, this is benign because the resulting NaN
or infinity values are discarded due to the masking condition. But the FMS code inherits the floating point environment of the main program, and in particular, if the main program is built with the -fpe0
flag, then division by zero in the FMS code will trigger a fatal exception, regardless of whether FMS itself was built with -fpe0
.
So to summarize, a debug-mode UFS build shouldn't be linked with an optimized FMS build because -fpe0
doesn't play nice with optimized code.
Thanks @J-Lentz ! Glad this is finally put to rest. Also note the confusion with -DCMAKE_BUILD_TYPE=Debug
was remedied by https://github.com/NOAA-GFDL/FMS/pull/1532
The problem:
I've been stuck on this issue in my LM4 NUOPC cap for UFS. As part of this project, I've brought in parts of the surface boundary layer scheme into a LM4 driver, working on the lands' unstructured grid.
There is a crash in
mo_drag
within a lightly modified version ofsurface_flux_1d
, but only when UFS is in debug mode (cmake flags-DDEBUG=ON -DCMAKE_BUILD_TYPE=Debug
).If I build with no debug flags, there is no crash in my surface_flux adaption.
the stack trace is:
This is with FMS 2022.04, so it seems to point to this spot in
monin_obukhov_solve_zeta
:It seems that zeta is/becomes zero during the solver iteration loop?
Attempts to debug stymied:
Input arguments of
surface_flux_1d
andmo_drag
appear to be well-behaved, and unremarkable realistic values. It appears that the cause of the crash is sensitive to wind speeds and bottom atmosphere layer temperature.Because UFS is using a release module of FMS, I can't dive into what values of the arguments might be causing an issue. And again, the issue only seems to appear when UFS is in debug mode.
I also built my own checkout of FMS 2022.04 both in Release and Debug modes.:
If I build and run with UFS's debug flags, and the Release version of FMS, to copy the module setup, there's no crash or sign of anything wrong. FMS is built using CMake with the flags:
-D32BIT=ON -D64BIT=ON -DOPENMP=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$FMS_INSTALL_DIR
. When UFS compiles, I then unload the FMS module and set FMS_INSTALL_DIR. And UFS's CMake happily picks this up with find_package / add_libraryHowever, if I build FMS's debug version with
-DCMAKE_BUILD_TYPE=Debug
instead ofRelease
, UFS on compile seems to find FMS and adds the library, but then can't find a subroutine in grid/grid2_mod:(On hera to avoid any possible C4 issues)