Open MicroTed opened 3 years ago
Hi Ted, you are correct that this appears to be something at the HDF5 level. I'll take a look, although if you were comfortable providing your code (or a C equivalent)? If that's not practical, I will try to recreate this issue so that I might step through the debugger and see what exactly is happening at the HDF level.
Hi, Ward, Well, I made a simple program to write a bunch of 3d arrays, and that runs fine. So the next step may be to add attributes and other things that are used in the cloud model.
I can confirm that there is no size limit, and NOAA produces some massive data files with many variables, using compression and parallel I/O. There is a test program which demonstrates and tests the parallel I/O with compression: nc_perf/tst_gfs_data_1.c. THis program is only built and run if netcdf-c is configured with --enable-benchmarks.
Right, Ed. I started on a new test code to recreate what the model sets up, but other things came up. I hope to get back to it soon.
I set this aside for a while, but I'm back on it, now with WRF, as well. I made a WRF IO option for NETCDFPAR (parallel io using the netcdf4 interface), separate from the default netcdf. It works. One of the issues I ran into was different processors trying to set different values for chunking, which was also happening in COMMAS.
Unfortunately, that didn't solve the original define error. The same error shows up, even without setting chunking explicitly and also turning off compression. (This is the case with about 250 3D arrays.)
Can you send a small test program and demonstrates the problem?
@edwardhartnett I tried to reproduce the issue with a simple code and lots of arrays, but it did not fail. I haven't gotten back to this in a while. My guess is there is something I'm missing in the my model code where a collective mpi action has conflicts, but I really don't know
OK, maybe you could close this issue and re-open it if this turns out to be related to netCDF.
Sure, good suggestion. I will do that.
OK, I made some progress on this and have a relatively small example that demonstrates my problem and can be found here (the readme has info on setup and running to get the enddef error):
https://github.com/MicroTed/ncpartest.git
Basically, it sets up a grid structure with a number of 0d, 1d, 2d, and 3d variables, which may or may not be time dependent. The dimensions are time (TIME), x axes (XC, XE), y axes (YC, YE) and z axes (ZC, ZE) (the C and E are for the center and edge positions of the grid points, on a Cartesian grid).
dimensions: TIME = UNLIMITED ; // (0 currently) XC = 80 ; XE = 81 ; YC = 40 ; YE = 41 ; ZC = 40 ; ZE = 41 ;
And then the corresponding coordinates are defined with the same name:
int TIME(TIME) ;
float XC(XC) ;
float XE(XE) ;
etc.
So if these coord variables are NOT defined, then it works with any number of total variables. But there's a caveat to that, because if the coords vars ARE defined and I turn off setting attributes for each variable, it works again. I don't know if there is any connection there -- any thoughts on that?
Edit: Tested with hdf5 1.10.7, netcdf 4.7.4, pnetcdf 1.9.0
In addition to testing with ifort, I also ran this test with gfortran-12, netcdf 4.8.0, hdf5 1.10.7, and got an extra error (overflow):
NetCDF: HDF error Note: The following floating-point exceptions are signalling: IEEE_OVERFLOW_FLAG
Further test with hdf5-1.13.2, netcdf-c 4.9.0, netcdf-fortran 4.6.0 (and mpich 4.0.2):
NetCDF: Problem with HDF5 dimscales. Note: The following floating-point exceptions are signalling: IEEE_OVERFLOW_FLAG STOP 2
@edwardhartnett @WardF My workaround now is to reduce the number of variable attributes. Curiously, I can add all the string/character attributes, but adding just one integer attribute trips the error. Those integers are not really needed for anything at this time, fortunately. At least it can work!
config.log.mvapich.txt
Environment Information
configure
)C
code to recreate the issue?Summary of Issue
I've been testing 4.7.4 with our cloud simulation model for the compressed hdf5 parallel I/O (mpiio). (HDF5 is 1.10.7, netCDF Fortran is 4.5.3, pnetcdf 1.9.0) It seems to work great until I switch to an option that takes the number of 3D arrarys from about 32 to about 250. Then it returns
DEFINE_NETCDF: Error ending define mode -101 NetCDF: HDF error
DEFINE_NETCDF: Error Closing File: NetCDF: HDF error
It does create a file (about 2MB) that is unreadable by ncdump.
The same error occurs when variable compression is not enabled, so I suppose that zlib compression is not the issue. There is no problem creating a 64-bit offset (pnetcdf) file or a serial-write file. (Note that I create pnetcdf files using the netcdf interface, but write the data with pnetcdf functions.) It doesn't seem to be a new issue, as it also occurs with my older installation of hdf5-1.8.9 + netcdf 4.3.1.1
The file size with pnetcdf is only 713MB for my test configuration (grid dimensions of 120x120x51). I think the -101 means an error in the HDF level?
I'm setting the file mode with consecutive 'ior' statements:
Any idea if this would be in the hdf5 part or something passed to it that it doesn't like? The same file creates fine as as serial (i.e., without the mpiio flag).
Thanks! -- Ted Mansell
Steps to reproduce the behavior