Closed dqwu closed 2 years ago
We have had some issues with intel and hdf5-1.10.x hanging.
One option that has solved a couple of our hangs is to use mpirun --mca io ompio
(or mpiexec). This has fixed some issues we were seeing with openmpi-4.0.3 with both gcc and intel.
OK, first we need a PR that makes this a test in netcdf-c.
Then we need a HDF5-only test which demonstrates the issue.
Then we can go to the HDF5 team with a bug report.
OK, I have made a test for this, tst_parallel7.c. It passes for me on my machine.
Let's put the test into the build to make sure is passes for everyone else. If it fails somewhere, then we have learned something. If it passes everywhere, then maybe this bug has been fixed...
(But let's wait until after the next netcdf-c release to merge this test.)
Something that you are doing in the code @dqwu which I did not expect users to do...
How come you don't change the variable mode to NC_COLLECTIVE when defining the metadata? I would have expected to see the nc_var_par_access() up before the nc_enddef(). By doing it this way, the mode is independent on most variables but collective on others.
Something that you are doing in the code @dqwu which I did not expect users to do...
How come you don't change the variable mode to NC_COLLECTIVE when defining the metadata? I would have expected to see the nc_var_par_access() up before the nc_enddef(). By doing it this way, the mode is independent on most variables but collective on others.
You are right, I forgot to call nc_var_par_access() before the nc_enddef() in my example code. However, this hanging issue is still reproducible with that minor change.
OK, does the issue occur with the current master and HDF5-1.12.2? Or just with historical versions of HDF5 and/or netcdf-c?
OK, does the issue occur with the current master and HDF5-1.12.2? Or just with historical versions of HDF5 and/or netcdf-c?
I can still reproduce this issue on my laptop with NetCDF 4.7.1 built with HDF5 1.10.5. I will try HDF5 1.12 later.
@edwardhartnett This issue is reproducible with NetCDF 4.7.1 built with HDF5 1.10.6 but not reproducible with NetCDF 4.7.1 built with HDF5 1.10.7. So it is likely a bug of HDF5 1.10.6 or older, which has been fixed in HDF5 1.10.7 or newer.
OK, then let's close this issue. There's not going to be anything we can do to make this work in old releases of HDF5. ;-)
OK, then let's close this issue. There's not going to be anything we can do to make this work in old releases of HDF5. ;-)
FYI, not sure if it is related to this bug fix in HDF5-1.10.7 release:
Bug Fixes since HDF5-1.10.6 release
==================================
Library
-------
- Fix bug and simplify collective metadata write operation when some ranks
have no entries to contribute. This fixes parallel regression test
failures with IBM SpectrumScale MPI on the Summit system at ORNL.
(QAK - 2020/09/02)
@dqwu could you please close this issue?
Close this issue since it has been fixed by HDF5 1.10.7.
The following simple NETCDF4/HDF5 test program can reproduce a hanging issue on some supercomputers.
Cori@NERSC /opt/cray/pe/netcdf-hdf5parallel/4.6.3.2/INTEL/19.0 /opt/cray/pe/hdf5-parallel/1.10.5.2/INTEL/19.0
Theta@ALCF /opt/cray/pe/netcdf-hdf5parallel/4.7.3.3/INTEL/19.1 /opt/cray/pe/hdf5-parallel/1.10.6.1/INTEL/19.1
This test case is run with 16 MPI tasks, and only rank 0 to rank 14 have some data to write (start/count is set to 0 for rank 15). It hangs forever when writing varid 76.
It can also be reproduced on my laptop with NetCDF 4.7.1 built with HDF5 1.10.5.
This issue might be related to HDF5 version, since it is not reproducible on ANL workstation compute001 with NetCDF 4.7.3 built with HDF5 1.8.16.