ACCESS-NRI / ACCESS-ESM1.5

The ACCESS Earth System Model (ACCESS-ESM) is a fully-coupled global climate model that includes atmoshpere, land, ocean, sea ice, ocean biogeochemistry and land biogeochemistry components, linked together by a coupler.
Apache License 2.0
0 stars 0 forks source link

Initial Spack env fails with segfault in MOM5 #6

Closed penguian closed 1 day ago

penguian commented 3 weeks ago

The ACCESS-ESM1.5 pre-industrial configuration defined by access-esm1.5-configs, but using the executables created by the initial Spack environment defined by spack.yaml on the 2-spack-yaml branch, as per testing related to access-esm1.5-configs #16 fails with a SIGSEGV segmentation violation in all 180 MOM5 ranks. The segmentation violation is in the HDF5 H5T__init_native_float_types() function, when opening a NetCDF4 file.

[gadi-cpu-clx-1435:3643495:0:3643495] Caught signal 8 (Floating point exception: floating-point invalid operation)
[...]
[gadi-cpu-clx-1434:2689398:0:2689398] Caught signal 8 (Floating point exception: floating-point invalid operation)
==== backtrace (tid:1270086) ====
==== backtrace (tid:1270077) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x00000000003ac858 H5T__init_native_float_types()  ???:0
 2 0x0000000000310908 H5T_init()  ???:0
 3 0x00000000003cff28 H5VL_init_phase2()  ???:0
 4 0x00000000000659c2 H5_init_library()  ???:0
 5 0x0000000000132ad5 H5Eset_auto2()  ???:0
 6 0x00000000000bbd8c nc4_hdf5_initialize()  ???:0
 7 0x00000000000c504c NC_HDF5_initialize()  ???:0
 8 0x0000000000028da8 nc_initialize()  ???:0
 9 0x000000000002ddfa NC_open()  ???:0
10 0x000000000002de3b nc__open()  ???:0
11 0x00000000000150e1 nf__open_()  /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-netcdf-fortran-4.6.1-22f4qcf67piiovm4vtfrl5g54eb4zfzr/spack-src/fortran/nf_control.F90:228
12 0x000000000164a862 mpp_io_mod_mp_mpp_open_()  /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-mom5-git.access-esm1.5_2024.05.24_access-esm1.5-ttg4y4yt3ddzhjywf5yfiicibk6xkx22/spack-src/src/shared/mpp/include/mpp_io_conne
ct.inc:510
13 0x000000000143af9c fms_io_mod_mp_get_file_unit_()  /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-mom5-git.access-esm1.5_2024.05.24_access-esm1.5-ttg4y4yt3ddzhjywf5yfiicibk6xkx22/spack-src/src/shared/fms/fms_io.F90:5440
14 0x0000000001460d57 fms_io_mod_mp_field_exist_()  /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-mom5-git.access-esm1.5_2024.05.24_access-esm1.5-ttg4y4yt3ddzhjywf5yfiicibk6xkx22/spack-src/src/shared/fms/fms_io.F90:5644
15 0x0000000001466bbc fms_io_mod_mp_fms_io_init_()  /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-mom5-git.access-esm1.5_2024.05.24_access-esm1.5-ttg4y4yt3ddzhjywf5yfiicibk6xkx22/spack-src/src/shared/fms/fms_io.F90:524
16 0x0000000001400f45 fms_mod_mp_fms_init_()  /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-mom5-git.access-esm1.5_2024.05.24_access-esm1.5-ttg4y4yt3ddzhjywf5yfiicibk6xkx22/spack-src/src/shared/fms/fms.F90:335
17 0x0000000000474f4a MAIN__()  /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-mom5-git.access-esm1.5_2024.05.24_access-esm1.5-ttg4y4yt3ddzhjywf5yfiicibk6xkx22/spack-src/src/access_coupler/ocean_solo.F90:219
18 0x0000000000410262 main()  ???:0
19 0x000000000003ad85 __libc_start_main()  ???:0
20 0x000000000041016e _start()  ???:0
[...]
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
fms_ACCESS-CM.x    0000000001B69979  Unknown               Unknown  Unknown
libpthread-2.28.s  000014C0865CACF0  Unknown               Unknown  Unknown
fms_ACCESS-CM.x    0000000001B69D52  Unknown               Unknown  Unknown
libpthread-2.28.s  000014C0865CACF0  Unknown               Unknown  Unknown
libhdf5.so.310.3.  000014C085486858  H5T__init_native_     Unknown  Unknown
libhdf5.so.310.3.  000014C0853EA908  H5T_init              Unknown  Unknown
libhdf5.so.310.3.  000014C0854A9F28  H5VL_init_phase2      Unknown  Unknown
libhdf5.so.310.3.  000014C08513F9C2  H5_init_library       Unknown  Unknown
libhdf5.so.310.3.  000014C08520CAD5  H5Eset_auto2          Unknown  Unknown
libnetcdf.so.19.2  000014C087B1AD8C  nc4_hdf5_initiali     Unknown  Unknown
libnetcdf.so.19.2  000014C087B2404C  NC_HDF5_initializ     Unknown  Unknown
libnetcdf.so.19.2  000014C087A87DA8  nc_initialize         Unknown  Unknown
libnetcdf.so.19.2  000014C087A8CDFA  NC_open               Unknown  Unknown
libnetcdf.so.19.2  000014C087A8CE3B  nc__open              Unknown  Unknown
libnetcdff.so.7.2  000014C0875DA0E1  nf__open_             Unknown  Unknown
fms_ACCESS-CM.x    000000000164A862  mpp_io_mod_mp_mpp         510  mpp_io_connect.inc
fms_ACCESS-CM.x    000000000143AF9C  fms_io_mod_mp_get        5440  fms_io.F90
fms_ACCESS-CM.x    0000000001460D57  fms_io_mod_mp_fie        5644  fms_io.F90
fms_ACCESS-CM.x    0000000001466BBC  fms_io_mod_mp_fms         524  fms_io.F90
fms_ACCESS-CM.x    0000000001400F45  fms_mod_mp_fms_in         335  fms.F90
fms_ACCESS-CM.x    0000000000474F4A  MAIN__                    219  ocean_solo.F90
fms_ACCESS-CM.x    0000000000410262  Unknown               Unknown  Unknown
libc-2.28.so       000014C08622DD85  __libc_start_main     Unknown  Unknown
fms_ACCESS-CM.x    000000000041016E  Unknown               Unknown  Unknown
[...]
penguian commented 3 weeks ago

The segfault is possibly caused by a known error introduced in hdf5-1.14.3 that is fixed in hdf5-1.14.4. See https://github.com/HDFGroup/hdf5/issues/4381 and https://github.com/HDFGroup/hdf5/issues/3831

penguian commented 3 weeks ago

The following change in packages/mom5/package.py results in a successful ACCESS-ESM1.5 pre-industrial run:

[pcl851@gadi-login-09 spack-packages]$ git diff
diff --git a/packages/mom5/package.py b/packages/mom5/package.py
index a36c149..6309dde 100644
--- a/packages/mom5/package.py
+++ b/packages/mom5/package.py
@@ -45,6 +45,9 @@ class Mom5(MakefilePackage):
         depends_on("libaccessom2~deterministic", when="~deterministic")
     with when("@access-esm1.5"):
         depends_on("oasis3-mct@access-esm1.5")
+        # Avoid segfault in HDF5 1.14.3
+        # https://github.com/HDFGroup/hdf5/issues/4381
+        depends_on("hdf5@:1.14.2,1.14.4:")

     phases = ["edit", "build", "install"]
CodeGat commented 1 day ago

Can we close this issue then @penguian ?

penguian commented 1 day ago

Closed by #5 (when merged).