NOAA-GFDL / FMS

GFDL's Flexible Modeling System
Other
94 stars 136 forks source link

Trouble diagnosing crash within mo_drag when called from LM4 UFS project #1256

Closed JustinPerket closed 2 months ago

JustinPerket commented 1 year ago

The problem:

I've been stuck on this issue in my LM4 NUOPC cap for UFS. As part of this project, I've brought in parts of the surface boundary layer scheme into a LM4 driver, working on the lands' unstructured grid.

There is a crash in mo_drag within a lightly modified version of surface_flux_1d, but only when UFS is in debug mode (cmake flags -DDEBUG=ON -DCMAKE_BUILD_TYPE=Debug).

If I build with no debug flags, there is no crash in my surface_flux adaption.

the stack trace is:

  14: forrtl: error (73): floating divide by zero
  14: Image              PC                Routine            Line        Source
  14: fv3_0rerun.exe     000000000380661B  Unknown               Unknown  Unknown
  14: fv3_0rerun.exe     000000000358B640  Unknown               Unknown  Unknown
  14: fv3_0rerun.exe     000000000341B538  monin_obukhov_int         267  monin_obukhov_inter.F90
  14: fv3_0rerun.exe     0000000003419E7A  monin_obukhov_int         181  monin_obukhov_inter.F90
  14: fv3_0rerun.exe     0000000002FA3E4A  monin_obukhov_mod         214  monin_obukhov.F90
  14: fv3_0rerun.exe     0000000000464A9B  lm4_surface_flux_         273  lm4_surface_flux.F90
  14: fv3_0rerun.exe     000000000042D789  lm4_driver_mp_sfc         421  lm4_driver.F90
  14: fv3_0rerun.exe     000000000041AC37  lm4_cap_mod_mp_mo         452  lm4_cap.F90

This is with FMS 2022.04, so it seems to point to this spot in monin_obukhov_solve_zeta:

         where (mask_1)
            rzeta  = 1.0/zeta
            zeta_0 = zeta/z_z0
            zeta_t = zeta/z_zt
            zeta_q = zeta/z_zq
         end where

It seems that zeta is/becomes zero during the solver iteration loop?

Attempts to debug stymied:

Input arguments of surface_flux_1d and mo_drag appear to be well-behaved, and unremarkable realistic values. It appears that the cause of the crash is sensitive to wind speeds and bottom atmosphere layer temperature.

Because UFS is using a release module of FMS, I can't dive into what values of the arguments might be causing an issue. And again, the issue only seems to appear when UFS is in debug mode.

I also built my own checkout of FMS 2022.04 both in Release and Debug modes.:

(On hera to avoid any possible C4 issues)

      /scratch2/GFDL/gfdlscr/Justin.Perket/UFSmodels/ufs-LM4/LM4-interface/LM4/land_data.F90(327): error #6285: There is no matching specific subroutine for this generic subroutine call.   [GET_GRID_CELL_CENTERS]
      call get_grid_cell_centers ('LND',lnd%sg_face,lnd%sg_lon, lnd%sg_lat,  domain=lnd%sg_domain)
      -------^
      /scratch2/GFDL/gfdlscr/Justin.Perket/UFSmodels/ufs-LM4/LM4-interface/LM4/land_data.F90(468): error #6285: There is no matching specific subroutine for this generic subroutine call.   [GET_GRID_CELL_VERTICES]
      call get_grid_cell_vertices('LND',lnd%ug_face,lnd%coord_glonb,lnd%coord_glatb)
uramirez8707 commented 1 year ago

Are you running with any openmp threads? Are you calling monin_obukhov_solve_zeta from an openmp region? Is this crash repeatable (it fails the same way every time?

rem1776 commented 1 year ago

@JustinPerket The debug mode for FMS's CMake build was recently added, it's mainly for allowing the person building to set custom flags (it'll compile with what is set in the CFLAGS and FCFLAGS environment variables). It doesn't add any flags on its own, just sets them directly to CFLAGS and FCFLAGS.

With this build, it's not adding the automatically added flags since it's set to debug so then it's failing to compile because its missing needed flags for the r4/r8 libraries which results in an arg mismatch in those interface calls. I would compile with the other build type (Release) if compiling with r4/r8, the debug can be used but flags will need to be manually set so you would have to do a separate compile for the r4/r8 libaries and also add in any debug flags via the environment variables.

We could potentially make this behave more standard-ly, and just have it add in expected debug flags.

JustinPerket commented 1 year ago

I should probably back up and say the top, main repo for this project is: https://github.com/JustinPerket/ufs-weather-model/tree/feature/LM4 LM4 is a submodule in it, with a NUOPC/ESMF driver.

And here is the temp branch of the LM4 NUOPC driver where I'm implementing some of the FMS surface boundary layer functionality into it: https://github.com/JustinPerket/lm4/tree/fix/boundary_layer

Here is an even more temporary branch that reproduces the error: https://github.com/JustinPerket/ufs-weather-model/tree/TMP/debug_modrag_crash

It's checked out on hera at /scratch2/GFDL/gfdlscr/Justin.Perket/UFSmodels/ufs-LM4-foo And a model run generated with cd tests && ./rt.sh -k -l lm4_tests.conf produces the error.

A more full trace of one of the threads is:

   4: ==== backtrace (tid:  66251) ====
   4:  0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
   4:  1 0x00000000033bbc28 monin_obukhov_inter_mp_monin_obukhov_solve_zeta_()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/fms-noaa-gfdl-2022.04/monin_obukhov/monin_obukhov_inter.F90:267
   4:  2 0x00000000033ba56a monin_obukhov_inter_mp_monin_obukhov_drag_1d_()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/fms-noaa-gfdl-2022.04/monin_obukhov/monin_obukhov_inter.F90:181
   4:  3 0x0000000002f43a7a monin_obukhov_mod_mp_mo_drag_1d_()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/fms-noaa-gfdl-2022.04/monin_obukhov/monin_obukhov.F90:214
   4:  4 0x0000000000483d12 lm4_surface_flux_mod_mp_lm4_surface_flux_1d_()  /scratch2/GFDL/gfdlscr/Justin.Perket/UFSmodels/ufs-LM4/LM4-interface/LM4/nuopc_cap/lm4_surface_flux.F90:351
   4:  5 0x0000000000449cc4 lm4_driver_mp_sfc_boundary_layer_()  /scratch2/GFDL/gfdlscr/Justin.Perket/UFSmodels/ufs-LM4/LM4-interface/LM4/nuopc_cap/lm4_driver.F90:562
   4:  6 0x0000000000430d47 lm4_cap_mod_mp_modeladvance_()  /scratch2/GFDL/gfdlscr/Justin.Perket/UFSmodels/ufs-LM4/LM4-interface/LM4/nuopc_cap/lm4_cap.F90:452
   4:  7 0x000000000167c82f ESMCI::MethodElement::execute()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
   4:  8 0x000000000167c7b2 ESMCI::MethodTable::execute()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
   4:  9 0x000000000167acb6 c_esmc_methodtableexecute_()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
   4: 10 0x000000000105af92 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:128\

   4: 11 0x000000000130ce50 nuopc_modelbase_mp_routine_run_()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/addon/NUOPC/src/NUOPC_ModelBase.F90:2220
   4: 12 0x000000000111bfd4 ESMCI::FTable::callVFuncPtr()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_FTable.C:2167
   4: 13 0x000000000111ff56 ESMCI_FTableCallEntryPointVMHop()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_FTable.C:824
   4: 14 0x00000000018a364f ESMCI::VMK::enter()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2308
   4: 15 0x000000000183211a ESMCI::VM::enter()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Infrastructure/VM/src/ESMCI_VM.C:1216
   4: 16 0x000000000111d667 c_esmc_ftablecallentrypointvm_()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_FTable.C:981
   4: 17 0x000000000102f90d esmf_compmod_mp_esmf_compexecute_()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMF_Comp.F90:1222
   4: 18 0x00000000013368b6 esmf_gridcompmod_mp_esmf_gridcomprun_()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMF_GridComp.F90:1891
   4: 19 0x0000000000fc5397 nuopc_driver_mp_routine_executegridcomp_()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/addon/NUOPC/src/NUOPC_Driver.F90:3329
   4: 20 0x0000000000fc4bec nuopc_driver_mp_executerunsequence_()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/addon/NUOPC/src/NUOPC_Driver.F90:3622
   4: 21 0x000000000167c82f ESMCI::MethodElement::execute()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
   4: 22 0x000000000167c7b2 ESMCI::MethodTable::execute()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
   4: 23 0x000000000167acb6 c_esmc_methodtableexecute_()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
   4: 24 0x000000000105af92 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:128\

   4: 25 0x0000000000fc1542 nuopc_driver_mp_routine_run_()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/addon/NUOPC/src/NUOPC_Driver.F90:3250
   4: 26 0x000000000111bfd4 ESMCI::FTable::callVFuncPtr()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_FTable.C:2167
   4: 27 0x000000000111ff56 ESMCI_FTableCallEntryPointVMHop()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_FTable.C:824
   4: 28 0x00000000018a364f ESMCI::VMK::enter()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2308
   4: 29 0x000000000183211a ESMCI::VM::enter()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Infrastructure/VM/src/ESMCI_VM.C:1216
   4: 30 0x000000000111d667 c_esmc_ftablecallentrypointvm_()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMCI_FTable.C:981
   4: 31 0x000000000102f90d esmf_compmod_mp_esmf_compexecute_()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMF_Comp.F90:1222
   4: 32 0x00000000013368b6 esmf_gridcompmod_mp_esmf_gridcomprun_()  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/src/intel-2022.1.2/pkg/v8.3.0b09/src/Superstructure/Component/src/ESMF_GridComp.F90:1891
   4: 33 0x000000000041bd70 MAIN__()  /scratch2/GFDL/gfdlscr/Justin.Perket/UFSmodels/ufs-LM4/driver/UFS.F90:398
   4: 34 0x0000000000418822 main()  ???:0
   4: 35 0x0000000000022555 __libc_start_main()  ???:0
   4: 36 0x0000000000418729 _start()  ???:0

Within my top LM4 routine, a modified version of sfc_boundary_layer is called, which calls a very lightly modified version of surface_flux_1d

After that, it's using FMS modules, so when mo-drag is called, it's from monin_obukhov_mod. though the source path in the trace is no longer present, it looks like: https://github.com/NOAA-GFDL/FMS/tree/2022.04/monin_obukhov

Are you running with any openmp threads? Are you calling monin_obukhov_solve_zeta from an openmp region? Is this crash repeatable (it fails the same way every time?

I'm not sure. There is openmp threading in UFS enabled by default. As far as I'm aware, I'm not explicitly building or using it. There is a UFS build option -DOPENMP=OFF which I tried.

JustinPerket commented 1 year ago

@rem1776

@JustinPerket The debug mode for FMS's CMake build was recently added, it's mainly for allowing the person building to set custom flags (it'll compile with what is set in the CFLAGS and FCFLAGS environment variables). It doesn't add any flags on its own, just sets them directly to CFLAGS and FCFLAGS.

Ok, that's what I thought it looked like it was doing.

With this build, it's not adding the automatically added flags since it's set to debug so then it's failing to compile because its missing needed flags for the r4/r8 libraries which results in an arg mismatch in those interface calls. I would compile with the other build type (Release) if compiling with r4/r8, the debug can be used but flags will need to be manually set so you would have to do a separate compile for the r4/r8 libaries and also add in any debug flags via the environment variables.

ok, this may be a red herring then. I was hoping it would give some insight on why this crash only occurs with UFS compiled with it's debug flag.

rem1776 commented 1 year ago

@JustinPerket It could be that the debug flags are catching a divide by zero that's happening in both (release & debug) runs.

In standard fortran you can divide real values by zero without an error, you would just get an infinite value as a result. The debug build is adding the -ftrapuv flag which causes an error on floating divide by zeros instead.

JustinPerket commented 1 year ago

@rem1776 Ahh, thanks! I didn't know that

JustinPerket commented 1 year ago

In that case, I t's most likely something is wrong with the arguments to mo_drag. Though from a debugger and write statements, they seem sensible. I'll dig into it more using my release build of FMS 2022.04

JustinPerket commented 1 year ago

So I'm unable to replicate the error produced by the FMS 2022.04 module on hera or gaea with my own build of FMS.

Like I said before, the crash using the FMS module that UFS uses seems to be at rzeta = 1.0/zeta in the subroutine monin_obukhov_solve_zeta:

         where (mask_1)
            rzeta  = 1.0/zeta
            zeta_0 = zeta/z_z0
            zeta_t = zeta/z_zt
            zeta_q = zeta/z_zq
         end where

but checking values for r_zeta and related variables with my build of FMS 2022.04 all seem fine. No Nans, Infs, and no values anywhere close to cause a divide by zero error. Inputs to it continue to seem fine to me, and inputs & outputs to its parent subroutine monin_obukhov_drag_1d also seem fine after it's called.


In standard fortran you can divide real values by zero without an error, you would just get an infinite value as a result. The debug build is adding the -ftrapuv flag which causes an error on floating divide by zeros instead.

I also tried adding --ftrapuv/-fpe0 to my Release build of FMS, which didn't seem to catch anything.

UFS's DEBUG mode adds flags -O0 -check -check noarg_temp_created -check nopointer -warn -warn noerrors -fp-stack-check -fstack-protector-all -fpe0 -debug -ftrapuv -init=snan,arrays" to the UFS build. I'm still unsure why that would cause the crash in pre-built FMS code when it use running out of the box, using the FMS module on Gaea or Hera.

J-Lentz commented 2 months ago

For the benefit of anyone searching for a solution to a similar problem:

When FMS is built with -O2 optimization flags or higher, the calculations inside the where clause in monin_obukhov_solve_zeta are speculatively executed without regard for which indices satisfy the masking condition, and in particular, calculations are performed for indices where division by zero occurs. As long as floating point exceptions are disabled, this is benign because the resulting NaN or infinity values are discarded due to the masking condition. But the FMS code inherits the floating point environment of the main program, and in particular, if the main program is built with the -fpe0 flag, then division by zero in the FMS code will trigger a fatal exception, regardless of whether FMS itself was built with -fpe0.

So to summarize, a debug-mode UFS build shouldn't be linked with an optimized FMS build because -fpe0 doesn't play nice with optimized code.

JustinPerket commented 2 months ago

Thanks @J-Lentz ! Glad this is finally put to rest. Also note the confusion with -DCMAKE_BUILD_TYPE=Debug was remedied by https://github.com/NOAA-GFDL/FMS/pull/1532