JCSDA / spack-stack

Creative Commons Zero v1.0 Universal
26 stars 44 forks source link

Crash in FMS when UFS with LM4 is built with Debug flag #1288

Open JustinPerket opened 2 weeks ago

JustinPerket commented 2 weeks ago

Describe the bug

There's an upcoming PR to introduce GFDL Land Model in UFS (https://github.com/ufs-community/ufs-weather-model/pull/2146) . It is unable to run with -DDEBUG=ON flag, when using the FMS module provided by spack-stack. There is no issue when using an un-optimized compile of FMS 2023.04 with debug flags. However, this is not available to UFS in the modules.

To Reproduce

# recreate failed test
git clone -b feature/LM4  --recursive git@github.com:JustinPerket/ufs-weather-model.git ufs-LM4
cd ufs-LM4/tests
# change regression test to debug 
# currently is : COMPILE | datm_cdeps_lm4 | intel | -DAPP=LND-LM4 | + hera orion gaea | fv3 |
# want: COMPILE | datm_cdeps_lm4 | intel | -DAPP=LND-LM4 -DDEBUG=ON | + hera orion gaea | fv3 |
sed -i 's|-DAPP=LND-LM4|-DAPP=LND-LM4 -DDEBUG=ON|g' lm4_tests.conf
# run LM4 regression tests, resulting in crash
./rt.sh -k -l lm4_tests.conf

The resulting crash occurs in this where statement within FMS monin_obukhov interface: https://github.com/NOAA-GFDL/FMS/blob/7f585284/monin_obukhov/include/monin_obukhov_inter.inc#L227

Expected behavior

I'll mostly quote @J-Lentz explanation from email:

because the [release build FMS module] is being used, the calculations inside the where clause in monin_obukhov_solve_zeta are speculatively executed without regard for which indices satisfy the masking condition, and in particular, calculations are performed for indices where division by zero occurs. As long as floating point exceptions are disabled, this is benign because the resulting NaN or infinity values are discarded due to the masking condition. But the FMS code inherits the floating point environment of the main program [UFS], and in particular, if [UFS] is built with the -fpe0 flag, then division by zero in the FMS code will trigger a fatal exception, regardless of whether FMS itself was built with -fpe0.

To avoid this issue, if UFS is built with CMake flag -DDEBUG=ON, it then would require use of a debug build of FMS to be available from the spack-stack environment. It would be great to see this for the newer FMS version for spack-stack 1.6.0 on Hera and Gaea (https://github.com/JCSDA/spack-stack/issues/1215).

System: Tested to occur on Hera, Gaea

Additional context

I tested using my own debug build of FMS , matching the spack-stack lua file options: -DGFS_PHYS=ON -DOPENMP=ON -DENABLE_QUAD_PRECISION=ON -DWITH_YAML=OFF -DCONSTANTS=GFS -D32BIT=ON -D64BIT=ON -DFPIC=ON -DUSE_DEPRECATED_IO=ON

but then added the debug flags -g -O0 -check -check noarg_temp_created -check nopointer -warn -warn noerrors -fpe0 -ftrapuv Then I unloaded the FMS module, set FMS_ROOT to this build, and then the debug UFS-LM4 regressions test ran without issue.

Note that because of https://github.com/NOAA-GFDL/FMS/pull/1532 , the behavior of CMAKE_Fortran_FLAGS_DEBUG changes to be more standard, starting with FMS 2024.02. Then the FMS CMake build options I used are simply:

CMAKE_FLAGS_FROM_SPACK_LUA="-DGFS_PHYS=ON -DOPENMP=ON -DENABLE_QUAD_PRECISION=ON -DWITH_YAML=OFF -DCONSTANTS=GFS -D32BIT=ON -D64BIT=ON -DFPIC=ON -DUSE_DEPRECATED_IO=ON"
CMAKE_FLAGS="-DCMAKE_BUILD_TYPE=Debug $CMAKE_FLAGS_FROM_SPACK_LUA"
climbfuji commented 2 weeks ago

I don't think we want a blanket debug fms for all applications in the unified environment. That can have real consequences on runtime. I am thinking that the correct course of action is to work with the FMS developers to address this problem by coding it differently and/or providing the correct flags and directives (for FMS and/or the UFS) to prevent this from happening when FMS is compiled in release mode. In the meantime, we can absolutely provide a debug FMS version in addition to the default release fms version on dedicated systems in spack-stack 1.8.0 (which will have fms@2024.02).

JustinPerket commented 2 weeks ago

Thanks Dom. Unless something else starts using FMS's Monin Obukhov interface, this seems like the issue is limited to the GFDL LM4. Perhaps in my upcoming PR, I could tweak UFS's CMakeLists.txt to use debug-built FMS libraries only if UFS is also debug, and there's a LM4 app?

climbfuji commented 2 weeks ago

Thanks Dom. Unless something else starts using FMS's Monin Obukhov interface, this seems like the issue is limited to the GFDL LM4. Perhaps in my upcoming PR, I could tweak UFS's CMakeLists.txt to use debug-built FMS libraries only if UFS is also debug, and there's a LM4 app?

That's a good idea, yes.