HDFGroup / hdf5

Official HDF5® Library Repository
https://www.hdfgroup.org/
Other
601 stars 244 forks source link

t_mpi aborts on Fedora Rawhide with mpich on s390x #3730

Open opoplawski opened 11 months ago

opoplawski commented 11 months ago

Describe the bug

Test log for t_mpi 
============================
*** Hint ***
You can use environment variable HDF5_PARAPREFIX to run parallel test files in a
different directory or to add file type prefix. e.g.,
   HDF5_PARAPREFIX=pfs:/PFS/user/me
   export HDF5_PARAPREFIX
*** End of Hint ***
===================================
MPI functionality tests
===================================
Abort(676932623) on node 2 (rank 2 in comm 0): Fatal error in internal_Barrier: Other MPI error, error stack:
internal_Barrier(84).......................: MPI_Barrier(MPI_COMM_WORLD) failed
MPID_Barrier(167)..........................: 
MPIDI_Barrier_allcomm_composition_json(132): 
MPIDI_POSIX_mpi_bcast(219).................: 
MPIDI_POSIX_mpi_bcast_release_gather(132)..: 
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not match across processes in the collective routine: Received 0 but expected 1
Command exited with non-zero status 15

Expected behavior No test failure

Platform (please complete the following information)

build-s390x.log

derobins commented 11 months ago

@jhendersonHDF , @lrknox - s390x is big-endian. Do we ever see t_mpi failures on our Power system? It's probably too late to investigate this for 1.14.3, but we could build a recent version of MPICH there and test for 1.14.4.

opoplawski commented 6 months ago

Test is still failing with latest hdf5_1_14 branch but with a different error message it seems:

make[4]: Entering directory '/builddir/build/BUILD/hdf5-hdf5_1_14/mpich/testpar'
============================
Testing: t_mpi 
============================
Test log for t_mpi 
============================
Command exited with non-zero status 15