COSIMA / access-om3

ACCESS-OM3 global ocean-sea ice-wave coupled model
13 stars 6 forks source link

Investigate possible floating point exception in HDF5 library #153

Open anton-seaice opened 2 months ago

anton-seaice commented 2 months ago

When using the 0.3.0 Debug build of OM3 (spack 0.21.2), @micaeljtoliveira found this error when running with the MOM6-CICE6 1deg_jra55do_iaf config

forrtl: error (65): floating invalid
Image             PC               Routine           Line       Source
libpthread-2.28.s 000014DFE4FC4CF0 Unknown              Unknown Unknown
libhdf5.so.310.3. 000014DFE2BA1B6B H5T__init_native_    Unknown Unknown
libhdf5.so.310.3. 000014DFE2B06FA8 H5T_init             Unknown Unknown
libhdf5.so.310.3. 000014DFE2BC5130 H5VL_init_phase2     Unknown Unknown
libhdf5.so.310.3. 000014DFE2856673 H5_init_library      Unknown Unknown
libhdf5.so.310.3. 000014DFE2925945 H5Eset_auto2         Unknown Unknown
libnetcdf.so.19   000014DFE7CB5ABC nc4_hdf5_initiali    Unknown Unknown
libnetcdf.so.19   000014DFE7CBDFEC NC_HDF5_initializ    Unknown Unknown
libnetcdf.so.19   000014DFE7C1ACB8 nc_initialize        Unknown Unknown
libnetcdf.so.19   000014DFE7C1E58A NC_open              Unknown Unknown
libnetcdf.so.19   000014DFE7C1E297 nc_open              Unknown Unknown
libnetcdff.so.7.2 000014DFE8011BC2 nf_open_             Unknown Unknown
libnetcdff.so.7.2 000014DFE8055329 netcdf_mp_nf90_op    Unknown Unknown
access-om3-MOM6-C 00000000072D2BB5 ice_read_write_mp       1072 ice_read_write.F90
access-om3-MOM6-C 000000000713E142 ice_grid_mp_init_        342 ice_grid.F90
access-om3-MOM6-C 00000000075B984B cice_initmod_mp_c         57 CICE_InitMod.F90
access-om3-MOM6-C 0000000006A87EA2 ice_comp_nuopc_mp        589 ice_comp_nuopc.F90
access-om3-MOM6-C 0000000000E79684 _ZN5ESMCI6FTable1       2167 ESMCI_FTable.C
access-om3-MOM6-C 0000000000E7D7BA ESMCI_FTableCallE        824 ESMCI_FTable.C
access-om3-MOM6-C 0000000001BA3BBF _ZN5ESMCI3VMK5ent       2321 ESMCI_VMKernel.C
access-om3-MOM6-C 0000000002274FC2 _ZN5ESMCI2VM5ente       1216 ESMCI_VM.C
access-om3-MOM6-C 0000000000E7AAC7 c_esmc_ftablecall        981 ESMCI_FTable.C
access-om3-MOM6-C 0000000000C3AD91 esmf_compmod_mp_e       1223 ESMF_Comp.F90
access-om3-MOM6-C 000000000132F5A9 esmf_gridcompmod_       1412 ESMF_GridComp.F90
access-om3-MOM6-C 0000000000B4CD64 nuopc_driver_mp_l       2886 NUOPC_Driver.F90
access-om3-MOM6-C 0000000000B1490F nuopc_driver_mp_i       1318 NUOPC_Driver.F90
access-om3-MOM6-C 0000000000E79684 _ZN5ESMCI6FTable1       2167 ESMCI_FTable.C
access-om3-MOM6-C 0000000000E7D7BA ESMCI_FTableCallE        824 ESMCI_FTable.C
access-om3-MOM6-C 0000000001BA3BBF _ZN5ESMCI3VMK5ent       2321 ESMCI_VMKernel.C
access-om3-MOM6-C 0000000002274FC2 _ZN5ESMCI2VM5ente       1216 ESMCI_VM.C
access-om3-MOM6-C 0000000000E7AAC7 c_esmc_ftablecall        981 ESMCI_FTable.C
access-om3-MOM6-C 0000000000C3AD91 esmf_compmod_mp_e       1223 ESMF_Comp.F90
access-om3-MOM6-C 000000000132F5A9 esmf_gridcompmod_       1412 ESMF_GridComp.F90
access-om3-MOM6-C 0000000000B4CD64 nuopc_driver_mp_l       2886 NUOPC_Driver.F90
access-om3-MOM6-C 0000000000B14B62 nuopc_driver_mp_i       1323 NUOPC_Driver.F90
access-om3-MOM6-C 0000000000AF9D7A nuopc_driver_mp_i        481 NUOPC_Driver.F90
access-om3-MOM6-C 0000000000E79684 _ZN5ESMCI6FTable1       2167 ESMCI_FTable.C
access-om3-MOM6-C 0000000000E7D7BA ESMCI_FTableCallE        824 ESMCI_FTable.C
access-om3-MOM6-C 0000000001BA3BBF _ZN5ESMCI3VMK5ent       2321 ESMCI_VMKernel.C
access-om3-MOM6-C 0000000002274FC2 _ZN5ESMCI2VM5ente       1216 ESMCI_VM.C
access-om3-MOM6-C 0000000000E7AAC7 c_esmc_ftablecall        981 ESMCI_FTable.C
access-om3-MOM6-C 0000000000C3AD91 esmf_compmod_mp_e       1223 ESMF_Comp.F90
access-om3-MOM6-C 000000000132F5A9 esmf_gridcompmod_       1412 ESMF_GridComp.F90
access-om3-MOM6-C 0000000000431FE8 MAIN__                   128 esmApp.F90
access-om3-MOM6-C 000000000043124D Unknown              Unknown Unknown
libc-2.28.so      000014DFE4C27D85 __libc_start_main    Unknown Unknown
access-om3-MOM6-C 000000000043116E Unknown              Unknown Unknown

The run did not fail with the Release build.

Debug sets -fpe0, so we believe there is a bug within the HDF5 library which causes the exception.

A small (not quite minimal) example to reproduce the problem (Use mpifort -fpe0) :

program nc_open_example
  use netcdf

  implicit none

  integer :: status, fid
  character(len=500) :: filename_nc3,  filename_nc4

  filename_nc4 = '/g/data/ik11/inputs/access-om3/0.x.0/1deg/cice/grid_2024.04.04.nc '
  filename_nc3 = '/g/data/ik11/inputs/access-om3/0.x.0/1deg/cice/grid.nc '

  status = nf90_open(filename_nc3, NF90_NOWRITE, fid)

  write(6,*) nf90_strerror(status)

  status = nf90_open(filename_nc4, NF90_NOWRITE, fid)

  write(6,*) nf90_strerror(status)

  status = nf90_open('link_to_grid.nc', NF90_NOWRITE, fid)

  write(6,*) nf90_strerror(status)

end program nc_open_example
dougiesquire commented 2 months ago

This has been fixed in hdf5 1.14.4 - see https://github.com/HDFGroup/hdf5/pull/3837

dougiesquire commented 2 months ago

The executables in

/g/data/ik11/spack/0.21.2/opt/linux-rocky8-cascadelake/intel-2021.10.0/access-om3-d6813d6b9e1df560ac3f6ba6a605daab9cfd9569_main-q4wfaqb

are built against hdf5@develop-1.14 which includes the above fix.

anton-seaice commented 1 month ago

There are no DEBUG builds in that folder - I guess I would need to do a seperate debug build using those modules from that path ?

anton-seaice commented 1 week ago

There are no DEBUG builds in that folder - I guess I would need to do a seperate debug build using those modules from that path ?

I may have misunderstood, when building through spack, the file name doesn't include Release/Debug like building though build.sh. So possibly your executable would have fixed the problem.

This bug may no longer be relevant. With the ACCESS-NRI build, specifying build_type=Debug doesn't flow down to the dependncies. i.e. the Debug flags are only on for compiling the access-om3 code (the bits in this repo and submodules) and not on for compiling hdf5.