Closed dqwu closed 1 year ago
E3SM cases are running with PIO_IOTYPE_NETCDF4P ?
"The error is due to H5DSattach_scale() calls failing in the NetCDF library." Should an issue be opened with NetCDF?
"The error is due to H5DSattach_scale() calls failing in the NetCDF library." Should an issue be opened with NetCDF?
E3SM cases are running with PIO_IOTYPE_NETCDF4P ?
The default IO type used by E3SM is pnetcdf, and we can test netcdf4p with xmlchange PIO_TYPENAME=netcdf4p
So we must have NetCDF 4.9.0 installed to avoid this?
So we must have NetCDF 4.9.0 installed to avoid this?
So as long as NetCDF 4.8.1 is not used, we can avoid this issue. If E3SM wants to upgrade NetCDF modules for some machines in config_machines.xml, it is suggested to bypass NetCDF 4.8.1.
@rljacob @jayeshkrishna It seems that some E3SM machines are currently using NetCDF 4.8.1: pm-cpu pm-gpu alvarez crusher chicoma-cpu ascent fugaku
IMO, although it would be good to have pio_iotype_netcdf4p working for E3SM it is not a high priority (we either use PnetCDF or NetCDF serial for all E3SM runs). So when upgrading the versions of the library please ensure that NetCDF serial works for all E3SM cases.
fwiw, I have a PR to update alvarez to cray-netcdf-hdf5parallel/4.9.0.1
. Assuming pm-cpu will follow.
With NetCDF 4.8.1 or later versions, some E3SM cases run with PIO_IOTYPE_NETCDF4P fail with HDF5 errors from nc_enddef() (when creating a NetCDF4 file). The error message from a run on one of the ANL GCE nodes (ne4 F case, NetCDF 4.9.0) is shown below,
The two NetCDF issues below have simple NETCDF4 test programs to reproduce the HDF5 errors: https://github.com/Unidata/netcdf-c/issues/2165 https://github.com/Unidata/netcdf-c/issues/2251
The error is due to H5DSattach_scale() calls failing in the NetCDF library. The error is returned in the NetCDF library in netcdf-c/libhdf5/nc4hdf.c, where the High-level DS API H5DSattach_scale is called multiple times inside a loop:
According to HDF5 developers, HDF5 does not test any of the HL DS APIs like H5DSattach_scale in a parallel setting and these APIs are intended to be called by a single process (a single process creating/opening the file and calling the API).
In some cases with enough iterations of the loop above, HDF5 might get out of step between the ranks, see https://github.com/Unidata/netcdf-c/issues/1822, causing the error.
Workaround:
NetCDF 4.9.0 introduced the NC_NODIMSCALE_ATTACH flag (when creating files) to make dimscale attachment to variables optional, see https://github.com/Unidata/netcdf-c/pull/2161
As a workaround, we can apply this new NetCDF option when creating files using PIO_IOTYPE_NETCDF4P to avoid calling H5DSattach_scale.