Unidata / netcdf-c

Official GitHub repository for netCDF-C libraries and utilities.
BSD 3-Clause "New" or "Revised" License
514 stars 262 forks source link

MPI-related test failure using mpich 4.2.0, gcc 13.2.0 #3015

Open WardF opened 1 month ago

WardF commented 1 month ago

Update: For clarity, the tests pass when using mpich 4.0, gcc 11.4.0.


I'm observing a failure using mpicc and running nc_test4/run_par_test.sh.

This issue occurs when running mpicc version 13.x, but does not occur on systems using mpicc version 11.x. This is most easily observed on my end using Ubuntu 22.04 vs. 24.04. I've created a couple of docker images which can be used to observe this. They can be run as follows:

$ docker run --rm -it docker.unidata.ucar.edu/h5par:2204 

and

$ docker run --rm -it docker.unidata.ucar.edu/h5par:2404

You can enter the environment by appending bash to the end of either docker command.

It seems that the issue is related to the different version of mpicc, but I'm trying to sort through what exactly is going on. Any suggestions would be appreciated.

The error specifically is as follows:

153: Testing simple parallel I/O with 16 processors...

153: 

153: *** Testing more advanced parallel access.

153: *** Testing parallel IO for raw-data with MPI-IO (driver)...

153: *** Testing more advanced parallel access.

153: *** Testing parallel IO for raw-data with MPI-IO (driver)...

153: *** Testing more advanced parallel access.

153: *** Testing parallel IO for raw-data with MPI-IO (driver)...Sorry! Unexpected result, /root/hdf5-1.14.3/netcdf-c/nc_test4/tst_parallel3.c, line: 284

153: Sorry! Unexpected result, /root/hdf5-1.14.3/netcdf-c/nc_test4/tst_parallel3.c, line: 91

153: 

153: *** Testing more advanced parallel access.

153: *** Testing parallel IO for raw-data with MPI-IO (driver)...

153: *** Testing more advanced parallel access.

153: *** Testing parallel IO for raw-data with MPI-IO (driver)...Sorry! Unexpected result, /root/hdf5-1.14.3/netcdf-c/nc_test4/tst_parallel3.c, line: 284

153: Sorry! Unexpected result, /root/hdf5-1.14.3/netcdf-c/nc_test4/tst_parallel3.c, line: 91

153:
(...)
WardF commented 1 month ago

@edwardhartnett @jhendersonHDF if anything leaps out at you, feel free to chime in, it might save some time as I dig through this! And if not, no worries XD. Thanks!

WardF commented 1 month ago

Additional notes:

On ubuntu 24.04, installing libhdf5-mpi-dev installs openmpi and related tools. This version of libhdf5 works just fine, although the nc_test4/run_par_test.sh script requires --oversubscribe be passed to mpiexec -n 16 ./tst_parallel3. Otherwise, there is a complaint if the machine has < 16 cores/processors/what-have-you.

Using mpich and a custom-built libhdf5, we cannot oversubscribe. However, this is not an issue, because invoking mpiexec -n 2 ./tst_parallel3 results in the same issue as if we passed 4, or 8, or 16. Running tst_parallel3 directly works, but of course it is bypassing MPI entirely.

Installing libhdf5-mpich-dev sees the same behavior as using the custom-built version of libhdf5. This suggests there is an issue when using mpich but not inherently MPI.