Open GeorgeVandenberghe-NOAA opened 3 years ago
Do the HDF5 parallel I/O tests pass?
Also what version of HDF5 was used?
Don't know. HDF5 parallel tests have problems on some of the NOAA systems where the subset used by NetCDF passes. , I'm treating the HDF5 parallel tests as an expensive to run down source of false positives. I have reported a few to HDF5 support and they've ackowledged problems. HDF5 serial tests all pass.
I am going to let Cray deal with "expensive to run down" because I am supposed to be decomposing emc-post
On Fri, Mar 19, 2021 at 12:30 PM Edward Hartnett @.***> wrote:
Do the HDF5 parallel I/O tests pass?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/1964#issuecomment-802958005, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FW5HGWXX4IKFZOO52DTEN32TANCNFSM4ZPBL6TQ .
--
George W Vandenberghe
IMSG at NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
@.***
301-683-3769(work) 3017751547(cell)
Do you know the version of HDF5?
What I suspect is that it's 1.10.7, and I'm wondering if 1.12.0 has been tried, and if it will work.
HDF5/1.10.6. Someone else tried 1.12.0 and it failed but I haven't tried it. I will add HDF5 version to the issue.. that was another miss on my part.
On Fri, Mar 19, 2021 at 12:58 PM Edward Hartnett @.***> wrote:
Do you know the version of HDF5?
What I suspect is that it's 1.10.7, and I'm wondering if 1.12.0 has been tried, and if it will work.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/1964#issuecomment-802976127, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FQ2KTPNAEMM3NJPPYTTEN7B5ANCNFSM4ZPBL6TQ .
--
George W Vandenberghe
IMSG at NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
@.***
301-683-3769(work) 3017751547(cell)
This is where the HDF5 tests hang on WCOSS2 using hdf5/1/12.0
MPIO GB file read test MPItest.h5 Test if MPI_File_get_size works correctly with MPItest.h5 Proc 5: hostname=nid001000 MPIO GB file write test MPItest.h5 MPIO GB file read test MPItest.h5 Test if MPI_File_get_size works correctly with MPItest.h5 Proc 3: hostname=nid001000 MPIO GB file write test MPItest.h5 MPIO GB file read test MPItest.h5 Test if MPI_File_get_size works correctly with MPItest.h5 0.02user 0.02system 0:00.58elapsed 7%CPU (0avgtext+0avgdata 22480maxresident)k 0inputs+0outputs (0major+8737minor)pagefaults 0swaps
Testing: t_bigio ^Z [1]+ Stopped make check George.Vandenberghe@nid001000:~/ns/l/lib/netp12/hdf5-1.12.0> ps PID TTY TIME CMD
The ctl-Z was after a long time of no progress
OK, there is a new version of HDF5, 1.12.1. Does it work? ;-)
@GeorgeVandenberghe-NOAA it would be great to test whether HDF5-1.12.1 fixes your problem.
Also, it would be great to test a build of the current master of the netcdf-c repo, and see how it does. The next release is coming soon, and there have been many changes...
We are treating this as a WCOSS2 vendor problem with their MPI implementation. We are building with HDF5 1.10.6 but 1.12.1 was tested and verified to both reproduce the problem and otherwise work. Parallel NetCDF is mostly working now on WCOSS2 and only hangs with unusual rank distributions. Improvement has occurred with occasional pre-acceptance MPI upgrades.
On Tue, Apr 5, 2022 at 7:22 AM Edward Hartnett @.***> wrote:
OK, there is a new version of HDF5, 1.12.1. Does it work? ;-)
— Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/1964#issuecomment-1088584472, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FRFSNBL3RHKK6GPYOTVDQO77ANCNFSM4ZPBL6TQ . You are receiving this because you authored the thread.Message ID: @.***>
--
George W Vandenberghe
IMSG at NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
@.***
301-683-3769(work) 3017751547(cell)
I think we tested HDF5 1.12.0.. the test was last November when trying to package a testcase. We are currently at 1.10.6 on WCOSS2.
On Tue, Apr 5, 2022 at 7:22 AM Edward Hartnett @.***> wrote:
OK, there is a new version of HDF5, 1.12.1. Does it work? ;-)
— Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/1964#issuecomment-1088584472, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FRFSNBL3RHKK6GPYOTVDQO77ANCNFSM4ZPBL6TQ . You are receiving this because you authored the thread.Message ID: @.***>
--
George W Vandenberghe
IMSG at NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
@.***
301-683-3769(work) 3017751547(cell)
HDF5/1.12 didn't. Haven't tried 1.12.1. The problem is now intermittent and fairly rare and Cray is treating it as a bug in their MPICH implementation.
Do we have a cmake build for NetCDF yet? (I thought the old autoconf and make build worked fine but ... )
On Tue, Apr 26, 2022 at 10:10 AM Edward Hartnett @.***> wrote:
@GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA it would be great to test whether HDF5-1.12.1 fixes your problem.
Also, it would be great to test a build of the current master of the netcdf-c repo, and see how it does. The next release is coming soon, and there have been many changes...
— Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/1964#issuecomment-1109846644, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FWHVZRRQDN67MPLAHDVG72NVANCNFSM4ZPBL6TQ . You are receiving this because you were mentioned.Message ID: @.***>
--
George W Vandenberghe
IMSG at NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
@.***
301-683-3769(work) 3017751547(cell)
es, there is a CMake build of netcdf. However it should make no difference to your MPI problems...
To report a non-security related issue, please provide:
The current way to reproduce it is with ./configure --prefix=$PREFIX --enable-netcdf-4 --disable-dap --enable-parallel4 --enable-parallel-tests --disable-shared when building, build and install normally and then submit an interactive batch job that requests a few nodes; cd to the build directory, and do make check. Running make check for parallel jobs varies from system to system; this is how it's done on WCOSS2
compiler and MPI are intel/19.1.3.304 and cray-mpich/8.1.2 HDF5 is 1.10.6
If you have a general question about the software, please view our Suggested Support Process.