Unidata / netcdf-c

Official GitHub repository for netCDF-C libraries and utilities.
BSD 3-Clause "New" or "Revised" License
508 stars 262 forks source link

netcdf-hdf5 parallel define size limit? #1886

Open MicroTed opened 3 years ago

MicroTed commented 3 years ago

config.log.mvapich.txt

Environment Information

Summary of Issue

I've been testing 4.7.4 with our cloud simulation model for the compressed hdf5 parallel I/O (mpiio). (HDF5 is 1.10.7, netCDF Fortran is 4.5.3, pnetcdf 1.9.0) It seems to work great until I switch to an option that takes the number of 3D arrarys from about 32 to about 250. Then it returns

DEFINE_NETCDF: Error ending define mode -101 NetCDF: HDF error
DEFINE_NETCDF: Error Closing File: NetCDF: HDF error

It does create a file (about 2MB) that is unreadable by ncdump.

The same error occurs when variable compression is not enabled, so I suppose that zlib compression is not the issue. There is no problem creating a 64-bit offset (pnetcdf) file or a serial-write file. (Note that I create pnetcdf files using the netcdf interface, but write the data with pnetcdf functions.) It doesn't seem to be a new issue, as it also occurs with my older installation of hdf5-1.8.9 + netcdf 4.3.1.1

The file size with pnetcdf is only 713MB for my test configuration (grid dimensions of 120x120x51). I think the -101 means an error in the HDF level?

I'm setting the file mode with consecutive 'ior' statements:

cmode = IOR(nf90_netcdf4, nf90_classic_model) 
cmode = IOR(cmode, nf90_mpiio) 

 status = nf90_create(filename(ibeg:iend), cmode, ncid, comm = my_comm, info = my_info)

Any idea if this would be in the hdf5 part or something passed to it that it doesn't like? The same file creates fine as as serial (i.e., without the mpiio flag).

Thanks! -- Ted Mansell

Steps to reproduce the behavior

WardF commented 3 years ago

Hi Ted, you are correct that this appears to be something at the HDF5 level. I'll take a look, although if you were comfortable providing your code (or a C equivalent)? If that's not practical, I will try to recreate this issue so that I might step through the debugger and see what exactly is happening at the HDF level.

MicroTed commented 3 years ago

Hi, Ward, Well, I made a simple program to write a bunch of 3d arrays, and that runs fine. So the next step may be to add attributes and other things that are used in the cloud model.

edwardhartnett commented 3 years ago

I can confirm that there is no size limit, and NOAA produces some massive data files with many variables, using compression and parallel I/O. There is a test program which demonstrates and tests the parallel I/O with compression: nc_perf/tst_gfs_data_1.c. THis program is only built and run if netcdf-c is configured with --enable-benchmarks.

MicroTed commented 3 years ago

Right, Ed. I started on a new test code to recreate what the model sets up, but other things came up. I hope to get back to it soon.

MicroTed commented 3 years ago

I set this aside for a while, but I'm back on it, now with WRF, as well. I made a WRF IO option for NETCDFPAR (parallel io using the netcdf4 interface), separate from the default netcdf. It works. One of the issues I ran into was different processors trying to set different values for chunking, which was also happening in COMMAS.

Unfortunately, that didn't solve the original define error. The same error shows up, even without setting chunking explicitly and also turning off compression. (This is the case with about 250 3D arrays.)

edwardhartnett commented 2 years ago

Can you send a small test program and demonstrates the problem?

MicroTed commented 2 years ago

@edwardhartnett I tried to reproduce the issue with a simple code and lots of arrays, but it did not fail. I haven't gotten back to this in a while. My guess is there is something I'm missing in the my model code where a collective mpi action has conflicts, but I really don't know

edwardhartnett commented 2 years ago

OK, maybe you could close this issue and re-open it if this turns out to be related to netCDF.

MicroTed commented 2 years ago

Sure, good suggestion. I will do that.

MicroTed commented 2 years ago

OK, I made some progress on this and have a relatively small example that demonstrates my problem and can be found here (the readme has info on setup and running to get the enddef error):

https://github.com/MicroTed/ncpartest.git

Basically, it sets up a grid structure with a number of 0d, 1d, 2d, and 3d variables, which may or may not be time dependent. The dimensions are time (TIME), x axes (XC, XE), y axes (YC, YE) and z axes (ZC, ZE) (the C and E are for the center and edge positions of the grid points, on a Cartesian grid).

dimensions: TIME = UNLIMITED ; // (0 currently) XC = 80 ; XE = 81 ; YC = 40 ; YE = 41 ; ZC = 40 ; ZE = 41 ;

And then the corresponding coordinates are defined with the same name:

    int TIME(TIME) ;
    float XC(XC) ;
    float XE(XE) ;
    etc.

So if these coord variables are NOT defined, then it works with any number of total variables. But there's a caveat to that, because if the coords vars ARE defined and I turn off setting attributes for each variable, it works again. I don't know if there is any connection there -- any thoughts on that?

Edit: Tested with hdf5 1.10.7, netcdf 4.7.4, pnetcdf 1.9.0

MicroTed commented 1 year ago

In addition to testing with ifort, I also ran this test with gfortran-12, netcdf 4.8.0, hdf5 1.10.7, and got an extra error (overflow):

NetCDF: HDF error Note: The following floating-point exceptions are signalling: IEEE_OVERFLOW_FLAG

Further test with hdf5-1.13.2, netcdf-c 4.9.0, netcdf-fortran 4.6.0 (and mpich 4.0.2):

NetCDF: Problem with HDF5 dimscales. Note: The following floating-point exceptions are signalling: IEEE_OVERFLOW_FLAG STOP 2

MicroTed commented 1 year ago

@edwardhartnett @WardF My workaround now is to reduce the number of variable attributes. Curiously, I can add all the string/character attributes, but adding just one integer attribute trips the error. Those integers are not really needed for anything at this time, fortunately. At least it can work!