E3SM-Project / scorpio

A high-level Parallel I/O Library for structured grid applications
19 stars 16 forks source link

PIO_IOTYPE_NETCDF4P requires NC_NODIMSCALE_ATTACH option #501

Closed dqwu closed 1 year ago

dqwu commented 1 year ago

With NetCDF 4.8.1 or later versions, some E3SM cases run with PIO_IOTYPE_NETCDF4P fail with HDF5 errors from nc_enddef() (when creating a NetCDF4 file). The error message from a run on one of the ANL GCE nodes (ne4 F case, NetCDF 4.9.0) is shown below,

[4] PIO: FATAL ERROR: Aborting... FATAL ERROR: NetCDF: Problem with HDF5 dimscales. (file = F2010_ne4_oQU240_netcdf4p.eam.h0.0001-01-01-00000.nc) (/scratch/wuda/E3SM/externals/scorpio/src/clib/pioc_support.c: 4341)

The two NetCDF issues below have simple NETCDF4 test programs to reproduce the HDF5 errors: https://github.com/Unidata/netcdf-c/issues/2165 https://github.com/Unidata/netcdf-c/issues/2251

The error is due to H5DSattach_scale() calls failing in the NetCDF library. The error is returned in the NetCDF library in netcdf-c/libhdf5/nc4hdf.c, where the High-level DS API H5DSattach_scale is called multiple times inside a loop:

 if (H5DSattach_scale(hdf5_var->hdf_datasetid, dsid, d) < 0) 
     return NC_EHDFERR; 

According to HDF5 developers, HDF5 does not test any of the HL DS APIs like H5DSattach_scale in a parallel setting and these APIs are intended to be called by a single process (a single process creating/opening the file and calling the API).

In some cases with enough iterations of the loop above, HDF5 might get out of step between the ranks, see https://github.com/Unidata/netcdf-c/issues/1822, causing the error.

Workaround:

NetCDF 4.9.0 introduced the NC_NODIMSCALE_ATTACH flag (when creating files) to make dimscale attachment to variables optional, see https://github.com/Unidata/netcdf-c/pull/2161

As a workaround, we can apply this new NetCDF option when creating files using PIO_IOTYPE_NETCDF4P to avoid calling H5DSattach_scale.

rljacob commented 1 year ago

E3SM cases are running with PIO_IOTYPE_NETCDF4P ?

rljacob commented 1 year ago

"The error is due to H5DSattach_scale() calls failing in the NetCDF library." Should an issue be opened with NetCDF?

dqwu commented 1 year ago

"The error is due to H5DSattach_scale() calls failing in the NetCDF library." Should an issue be opened with NetCDF?

See https://github.com/Unidata/netcdf-c/issues/2251

dqwu commented 1 year ago

E3SM cases are running with PIO_IOTYPE_NETCDF4P ?

The default IO type used by E3SM is pnetcdf, and we can test netcdf4p with xmlchange PIO_TYPENAME=netcdf4p

rljacob commented 1 year ago

So we must have NetCDF 4.9.0 installed to avoid this?

dqwu commented 1 year ago

So we must have NetCDF 4.9.0 installed to avoid this?

So as long as NetCDF 4.8.1 is not used, we can avoid this issue. If E3SM wants to upgrade NetCDF modules for some machines in config_machines.xml, it is suggested to bypass NetCDF 4.8.1.

dqwu commented 1 year ago

@rljacob @jayeshkrishna It seems that some E3SM machines are currently using NetCDF 4.8.1: pm-cpu pm-gpu alvarez crusher chicoma-cpu ascent fugaku

jayeshkrishna commented 1 year ago

IMO, although it would be good to have pio_iotype_netcdf4p working for E3SM it is not a high priority (we either use PnetCDF or NetCDF serial for all E3SM runs). So when upgrading the versions of the library please ensure that NetCDF serial works for all E3SM cases.

ndkeen commented 1 year ago

fwiw, I have a PR to update alvarez to cray-netcdf-hdf5parallel/4.9.0.1. Assuming pm-cpu will follow.