Open gmao-rreichle opened 8 months ago
Hmm. There definitely was no change in netcdf-fortran in the newer Baselibs. And while HDF5 did update, if that caused it, every netcdf open would fall apart.
Such an odd traceback:
3 /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(H5T__init_native_float_types+0xfc8) [0x2adbc3395c18]
4 /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(H5T_init+0x98) [0x2adbc32fb3f8]
5 /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(H5VL_init_phase2+0x78) [0x2adbc33b8618]
6 /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(H5_init_library+0x26b) [0x2adbc30f6aeb]
7 /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(H5Eset_auto2+0x205) [0x2adbc3172ec5]
8 /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(+0xcb9f33) [0x2adbc2d72f33]
9 /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(nc4_hdf5_initialize+0x1d) [0x2adbc2d72f58]
10 /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(NC_HDF5_initialize+0x2b) [0x2adbc2d705a3]
11 /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(nc_initialize+0xdd) [0x2adbc2dd15ef]
12 /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(NC_open+0x8e) [0x2adbc2d0feff]
13 /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.pfio.so(nc_open+0x5d) [0x2adbc2d0eeae]
14 /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.so(nf_open_+0xa2) [0x2adbbdd4c372]
15 /discover/nobackup/rreichle/SystemTests/runs/LDAS_DEBUGCONUS/model/CURRENT/build//lib/libMAPL.so(netcdf_mp_nf90_open_+0xf9) [0x2adbbdcfa3a9]
It's almost like the file is odd. I think we need to know what file was trying to be opened and take a look at it. I wonder if it's something that HDF5 1.10 let through but HDF5 1.14 is a bit more sensitive or exacting with?
@mathomp4, I think the perhaps more useful part of the backtrace is ~line 2567 in GEOSldas_err_txt, see below for an excerpt. The run is trying to read a MERRA-2 file. Here's the corresponding log entry from the successful CONUS run with standard optimization:
opening file: ../input/met_forcing/MERRA2_land_forcing//MERRA2_400/diag/Y2013/M12/MERRA2_400.tavg1_2d_rad_Nx.20131231.nc4
Excerpt from GEOSldas_err_txt of failed run:
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
GEOSldas.x 00000000012F695E Unknown Unknown Unknown
libpthread-2.22.s 00002B3489155CE0 Unknown Unknown Unknown
GEOSldas.x 00000000012F6D4F Unknown Unknown Unknown
libpthread-2.22.s 00002B3489155CE0 Unknown Unknown Unknown
libMAPL.pfio.so 00002B3480991C18 Unknown Unknown Unknown
libMAPL.pfio.so 00002B34808F73F8 Unknown Unknown Unknown
libMAPL.pfio.so 00002B34809B4618 Unknown Unknown Unknown
libMAPL.pfio.so 00002B34806F2AEB Unknown Unknown Unknown
libMAPL.pfio.so 00002B348076EEC5 Unknown Unknown Unknown
libMAPL.pfio.so 00002B348036EF33 Unknown Unknown Unknown
libMAPL.pfio.so 00002B348036EF58 Unknown Unknown Unknown
libMAPL.pfio.so 00002B348036C5A3 Unknown Unknown Unknown
libMAPL.pfio.so 00002B34803CD5EF Unknown Unknown Unknown
libMAPL.pfio.so 00002B348030BEFF Unknown Unknown Unknown
libMAPL.pfio.so 00002B348030AEAE Unknown Unknown Unknown
libMAPL.so 00002B347B348372 Unknown Unknown Unknown
libMAPL.so 00002B347B2F63A9 Unknown Unknown Unknown
GEOSldas.x 00000000005CF358 ldas_forcemod_mp_ 5376 LDAS_Forcing.F90
GEOSldas.x 000000000058468E ldas_forcemod_mp_ 3741 LDAS_Forcing.F90
GEOSldas.x 00000000004C5415 ldas_forcemod_mp_ 332 LDAS_Forcing.F90
GEOSldas.x 0000000000487C7D geos_metforcegrid 708 GEOS_MetforceGridComp.F90
@gmao-rreichle This might be a moot issue. We've discovered some other issues with HDF5 1.14 in some of our testing. So I might be moving back our HDF5 to 1.10 for now.
Weirdly, the issues we see in Baselibs with HDF5 1.14 don't seem to be happening with Spack + 1.14, so I'm...perplexed.
I guess this issue is solved by this
@weiyuan-jiang No. That was an attempt to work around it. I'm currently trying to build Baselibs 7.20.0 everywhere and then I'll make a new ESMA_env which reverts to HDF5 1.10
The
LDAS_DEBUGCONUS/model
test crashes with a floating point exception when using ESMA_env v4.26.0. The test runs ok with ESMA_env v4.23.0. All other tests (incl. GNUDEBUGCONUS) are ok.Note that ESMA_env v4.26.0 uses a new version of HDF5.
The GEOSldas "err" and "log" files from the run that crashed are: GEOSldas_err_txt.txt GEOSldas_log_txt.txt The log file suggests that the floating point exception occurs when opening an GEOS nc file (Line 5376 of LDAS_Forcing.F90) using
nf90_open()
.I overlooked this problem when testing for #713, where I probably only ran the standard tests and not the debug tests.
I suspect the problem is not within ESMA_env v4.26.0 but rather poor coding in LDAS that is exposed with the DEBUG build .
cc: @mathomp4 @weiyuan-jiang @biljanaorescanin