Unidata / netcdf-c

Official GitHub repository for netCDF-C libraries and utilities.
BSD 3-Clause "New" or "Revised" License
511 stars 263 forks source link

Netcdf-c/4.7.4 parallel MPI tests fail with hangs on the NOAA WCOSS2 platform. #1964

Open GeorgeVandenberghe-NOAA opened 3 years ago

GeorgeVandenberghe-NOAA commented 3 years ago

To report a non-security related issue, please provide:

The current way to reproduce it is with ./configure --prefix=$PREFIX --enable-netcdf-4 --disable-dap --enable-parallel4 --enable-parallel-tests --disable-shared when building, build and install normally and then submit an interactive batch job that requests a few nodes; cd to the build directory, and do make check. Running make check for parallel jobs varies from system to system; this is how it's done on WCOSS2

compiler and MPI are intel/19.1.3.304 and cray-mpich/8.1.2 HDF5 is 1.10.6

If you have a general question about the software, please view our Suggested Support Process.

edwardhartnett commented 3 years ago

Do the HDF5 parallel I/O tests pass?

Also what version of HDF5 was used?

GeorgeVandenberghe-NOAA commented 3 years ago

Don't know. HDF5 parallel tests have problems on some of the NOAA systems where the subset used by NetCDF passes. , I'm treating the HDF5 parallel tests as an expensive to run down source of false positives. I have reported a few to HDF5 support and they've ackowledged problems. HDF5 serial tests all pass.

I am going to let Cray deal with "expensive to run down" because I am supposed to be decomposing emc-post

On Fri, Mar 19, 2021 at 12:30 PM Edward Hartnett @.***> wrote:

Do the HDF5 parallel I/O tests pass?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/1964#issuecomment-802958005, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FW5HGWXX4IKFZOO52DTEN32TANCNFSM4ZPBL6TQ .

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

edwardhartnett commented 3 years ago

Do you know the version of HDF5?

What I suspect is that it's 1.10.7, and I'm wondering if 1.12.0 has been tried, and if it will work.

GeorgeVandenberghe-NOAA commented 3 years ago

HDF5/1.10.6. Someone else tried 1.12.0 and it failed but I haven't tried it. I will add HDF5 version to the issue.. that was another miss on my part.

On Fri, Mar 19, 2021 at 12:58 PM Edward Hartnett @.***> wrote:

Do you know the version of HDF5?

What I suspect is that it's 1.10.7, and I'm wondering if 1.12.0 has been tried, and if it will work.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/1964#issuecomment-802976127, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FQ2KTPNAEMM3NJPPYTTEN7B5ANCNFSM4ZPBL6TQ .

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

GeorgeVandenberghe-NOAA commented 3 years ago

This is where the HDF5 tests hang on WCOSS2 using hdf5/1/12.0

MPIO GB file read test MPItest.h5 Test if MPI_File_get_size works correctly with MPItest.h5 Proc 5: hostname=nid001000 MPIO GB file write test MPItest.h5 MPIO GB file read test MPItest.h5 Test if MPI_File_get_size works correctly with MPItest.h5 Proc 3: hostname=nid001000 MPIO GB file write test MPItest.h5 MPIO GB file read test MPItest.h5 Test if MPI_File_get_size works correctly with MPItest.h5 0.02user 0.02system 0:00.58elapsed 7%CPU (0avgtext+0avgdata 22480maxresident)k 0inputs+0outputs (0major+8737minor)pagefaults 0swaps

Finished testing t_mpi

make[4]: Leaving directory '/lfs/h1/emc/nceplibs/noscrub/gwv/l/lib/netp12/hdf5-1.12.0/testpar' make[4]: Entering directory '/lfs/h1/emc/nceplibs/noscrub/gwv/l/lib/netp12/hdf5-1.12.0/testpar'

Testing: t_bigio ^Z [1]+ Stopped make check George.Vandenberghe@nid001000:~/ns/l/lib/netp12/hdf5-1.12.0> ps PID TTY TIME CMD

The ctl-Z was after a long time of no progress

edwardhartnett commented 2 years ago

OK, there is a new version of HDF5, 1.12.1. Does it work? ;-)

edwardhartnett commented 2 years ago

@GeorgeVandenberghe-NOAA it would be great to test whether HDF5-1.12.1 fixes your problem.

Also, it would be great to test a build of the current master of the netcdf-c repo, and see how it does. The next release is coming soon, and there have been many changes...

GeorgeVandenberghe-NOAA commented 1 year ago

We are treating this as a WCOSS2 vendor problem with their MPI implementation. We are building with HDF5 1.10.6 but 1.12.1 was tested and verified to both reproduce the problem and otherwise work. Parallel NetCDF is mostly working now on WCOSS2 and only hangs with unusual rank distributions. Improvement has occurred with occasional pre-acceptance MPI upgrades.

On Tue, Apr 5, 2022 at 7:22 AM Edward Hartnett @.***> wrote:

OK, there is a new version of HDF5, 1.12.1. Does it work? ;-)

— Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/1964#issuecomment-1088584472, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FRFSNBL3RHKK6GPYOTVDQO77ANCNFSM4ZPBL6TQ . You are receiving this because you authored the thread.Message ID: @.***>

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

GeorgeVandenberghe-NOAA commented 1 year ago

I think we tested HDF5 1.12.0.. the test was last November when trying to package a testcase. We are currently at 1.10.6 on WCOSS2.

On Tue, Apr 5, 2022 at 7:22 AM Edward Hartnett @.***> wrote:

OK, there is a new version of HDF5, 1.12.1. Does it work? ;-)

— Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/1964#issuecomment-1088584472, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FRFSNBL3RHKK6GPYOTVDQO77ANCNFSM4ZPBL6TQ . You are receiving this because you authored the thread.Message ID: @.***>

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

GeorgeVandenberghe-NOAA commented 1 year ago

HDF5/1.12 didn't. Haven't tried 1.12.1. The problem is now intermittent and fairly rare and Cray is treating it as a bug in their MPICH implementation.

Do we have a cmake build for NetCDF yet? (I thought the old autoconf and make build worked fine but ... )

On Tue, Apr 26, 2022 at 10:10 AM Edward Hartnett @.***> wrote:

@GeorgeVandenberghe-NOAA https://github.com/GeorgeVandenberghe-NOAA it would be great to test whether HDF5-1.12.1 fixes your problem.

Also, it would be great to test a build of the current master of the netcdf-c repo, and see how it does. The next release is coming soon, and there have been many changes...

— Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/1964#issuecomment-1109846644, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDS4FWHVZRRQDN67MPLAHDVG72NVANCNFSM4ZPBL6TQ . You are receiving this because you were mentioned.Message ID: @.***>

--

George W Vandenberghe

IMSG at NOAA/NWS/NCEP/EMC

5830 University Research Ct., Rm. 2141

College Park, MD 20740

@.***

301-683-3769(work) 3017751547(cell)

edwardhartnett commented 1 year ago

es, there is a CMake build of netcdf. However it should make no difference to your MPI problems...