GEOS-ESM / GEOSldas

Repository for the GEOS Land Data Assimilation Fixture
Apache License 2.0
11 stars 10 forks source link

floating point exception in LDAS_DEBUGCONUS/model test when using ESMA_env v4.26.0 #726

Open gmao-rreichle opened 8 months ago

gmao-rreichle commented 8 months ago

The LDAS_DEBUGCONUS/model test crashes with a floating point exception when using ESMA_env v4.26.0. The test runs ok with ESMA_env v4.23.0. All other tests (incl. GNUDEBUGCONUS) are ok.

Note that ESMA_env v4.26.0 uses a new version of HDF5.

The GEOSldas "err" and "log" files from the run that crashed are: GEOSldas_err_txt.txt GEOSldas_log_txt.txt The log file suggests that the floating point exception occurs when opening an GEOS nc file (Line 5376 of LDAS_Forcing.F90) using nf90_open().

I overlooked this problem when testing for #713, where I probably only ran the standard tests and not the debug tests.

I suspect the problem is not within ESMA_env v4.26.0 but rather poor coding in LDAS that is exposed with the DEBUG build .

cc: @mathomp4 @weiyuan-jiang @biljanaorescanin

mathomp4 commented 8 months ago

Hmm. There definitely was no change in netcdf-fortran in the newer Baselibs. And while HDF5 did update, if that caused it, every netcdf open would fall apart.

mathomp4 commented 8 months ago

Such an odd traceback:

 3  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(H5T__init_native_float_types+0xfc8) [0x2adbc3395c18]
 4  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(H5T_init+0x98) [0x2adbc32fb3f8]
 5  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(H5VL_init_phase2+0x78) [0x2adbc33b8618]
 6  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(H5_init_library+0x26b) [0x2adbc30f6aeb]
 7  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(H5Eset_auto2+0x205) [0x2adbc3172ec5]
 8  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(+0xcb9f33) [0x2adbc2d72f33]
 9  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(nc4_hdf5_initialize+0x1d) [0x2adbc2d72f58]
10  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(NC_HDF5_initialize+0x2b) [0x2adbc2d705a3]
11  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(nc_initialize+0xdd) [0x2adbc2dd15ef]
12  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(NC_open+0x8e) [0x2adbc2d0feff]
13  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(nc_open+0x5d) [0x2adbc2d0eeae]
14  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.so(nf_open_+0xa2) [0x2adbbdd4c372]
15  /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.so(netcdf_mp_nf90_open_+0xf9) [0x2adbbdcfa3a9]

It's almost like the file is odd. I think we need to know what file was trying to be opened and take a look at it. I wonder if it's something that HDF5 1.10 let through but HDF5 1.14 is a bit more sensitive or exacting with?

gmao-rreichle commented 8 months ago

@mathomp4, I think the perhaps more useful part of the backtrace is ~line 2567 in GEOSldas_err_txt, see below for an excerpt. The run is trying to read a MERRA-2 file. Here's the corresponding log entry from the successful CONUS run with standard optimization:

opening file: ../input/met_forcing/MERRA2_land_forcing//MERRA2_400/diag/Y2013/M12/MERRA2_400.tavg1_2d_rad_Nx.20131231.nc4

Excerpt from GEOSldas_err_txt of failed run:

=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
GEOSldas.x         00000000012F695E  Unknown               Unknown  Unknown
libpthread-2.22.s  00002B3489155CE0  Unknown               Unknown  Unknown
GEOSldas.x         00000000012F6D4F  Unknown               Unknown  Unknown
libpthread-2.22.s  00002B3489155CE0  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B3480991C18  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B34808F73F8  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B34809B4618  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B34806F2AEB  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B348076EEC5  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B348036EF33  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B348036EF58  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B348036C5A3  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B34803CD5EF  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B348030BEFF  Unknown               Unknown  Unknown
libMAPL.pfio.so    00002B348030AEAE  Unknown               Unknown  Unknown
libMAPL.so         00002B347B348372  Unknown               Unknown  Unknown
libMAPL.so         00002B347B2F63A9  Unknown               Unknown  Unknown
GEOSldas.x         00000000005CF358  ldas_forcemod_mp_        5376  LDAS_Forcing.F90
GEOSldas.x         000000000058468E  ldas_forcemod_mp_        3741  LDAS_Forcing.F90
GEOSldas.x         00000000004C5415  ldas_forcemod_mp_         332  LDAS_Forcing.F90
GEOSldas.x         0000000000487C7D  geos_metforcegrid         708  GEOS_MetforceGridComp.F90
mathomp4 commented 7 months ago

@gmao-rreichle This might be a moot issue. We've discovered some other issues with HDF5 1.14 in some of our testing. So I might be moving back our HDF5 to 1.10 for now.

Weirdly, the issues we see in Baselibs with HDF5 1.14 don't seem to be happening with Spack + 1.14, so I'm...perplexed.

weiyuan-jiang commented 7 months ago

I guess this issue is solved by this

mathomp4 commented 7 months ago

@weiyuan-jiang No. That was an attempt to work around it. I'm currently trying to build Baselibs 7.20.0 everywhere and then I'll make a new ESMA_env which reverts to HDF5 1.10