HDFGroup / hdf5

Official HDF5® Library Repository
https://www.hdfgroup.org/
Other
632 stars 257 forks source link

Parallel test failure with 1.14.X and mpich #2474

Open opoplawski opened 1 year ago

opoplawski commented 1 year ago

Describe the bug Looking at updating the Fedora hdf5 package to 1.14.0. Seeing the following test failure:

Testing write to shared filtered chunks w/ two unlimited dimensions
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 415991 RUNNING AT a300054fffa142a9bb1b7ab6e1821ce2
=   EXIT CODE: 14
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
Command exited with non-zero status 15

Platform (please complete the following information)

Test builds are happening here: https://copr.fedorainfracloud.org/coprs/orion/hdf5-1.14/builds and full logs can be found there.

byrnHDF commented 1 year ago

Looks like t_cache_image test hung.

lrknox commented 1 year ago

Looks like t_cache_image test hung.

t_cache_image in one run, t_filters_parallel in another timed out.

jhendersonHDF commented 1 year ago

@opoplawski Would it be possible to try the same builds using at least MPICH 4.0.3 to see if there were any bugs fixed that may have caused these issues?

fortnern commented 1 year ago

We've noticed a bug in mpich 4.0-4.0.3 using the default "ch4:ofi" device. Last we checked it had been fixed in the 4.1 branch. If you need to use 4.0-4.0.3 you can work around it by building mpich with "--with-device=ch4:ucx" or "--with-device=ch3"

Edit: Just confirmed the 4.1 release passes with ch4:ofi

opoplawski commented 1 year ago

Looks like the Fedora mpich package is built with --with-device=ch3:nemesis

derobins commented 1 year ago

@opoplawski - Is this still a problem with the hdf5_1_14 branch?

opoplawski commented 1 year ago

Loos like it is:

make[4]: Entering directory '/builddir/build/BUILD/hdf5-hdf5_1_14/mpich/testpar'
============================
Testing: t_filters_parallel 
============================
Test log for t_filters_parallel 
============================
==========================
  Parallel Filters tests
==========================

** Some tests will be skipped due to TestExpress setting.
** Exhaustive tests will only be performed for the first available filter.
** Set the HDF5TestExpress environment variable to 0 to perform exhaustive testing for all available filters.

Test Info:
  MPI size: 6
  Test express level: 3
  Using seed: 1697792661

...

== Running tests in mode 'USE_MULTIPLE_DATASETS_MIXED_FILTERED' with filter 'Deflate' using selection I/O mode 'on', 'Multi-Chunk I/O' and 'Early' allocation time ==

Testing write to one-chunk filtered dataset
Testing write to unshared filtered chunks
Testing partial write to unshared filtered chunks
Testing write to shared filtered chunks
Testing write to unshared filtered chunks w/ single unlimited dimension
Testing write to shared filtered chunks w/ single unlimited dimension
Testing write to unshared filtered chunks w/ two unlimited dimensions
Testing write to shared filtered chunks w/ two unlimited dimensions
Testing write to filtered chunks with a single process having no selection
Testing write to filtered chunks with all processes having no selection
Testing write to filtered chunks with point selection
Testing interleaved write to filtered chunks
Testing write to unshared transformed and filtered chunks
Testing write to unshared filtered chunks on separate pages in 3D dataset
Testing write to unshared filtered chunks on the same pages in 3D dataset
Testing write to shared filtered chunks in 3D dataset
Testing write to unshared filtered chunks in Compound Datatype dataset without Datatype conversion
Testing write to shared filtered chunks in Compound Datatype dataset without Datatype conversion
Testing write to unshared filtered chunks in Compound Datatype dataset with Datatype conversion
Testing write to shared filtered chunks in Compound Datatype dataset with Datatype conversion
Testing read from one-chunk filtered dataset
Testing read from unshared filtered chunks
Testing read from shared filtered chunks
Testing read from filtered chunks with a single process having no selection
Testing read from filtered chunks with all processes having no selection
Testing read from filtered chunks with point selection
Testing interleaved read from filtered chunks
Testing read from unshared transformed and filtered chunks
Testing read from unshared filtered chunks on separate pages in 3D dataset
Testing read from unshared filtered chunks on the same pages in 3D dataset
Testing read from shared filtered chunks in 3D dataset
Testing read from unshared filtered chunks in Compound Datatype dataset without Datatype conversion
Testing read from shared filtered chunks in Compound Datatype dataset without Datatype conversion
Testing read from unshared filtered chunks in Compound Datatype dataset with Datatype conversion
Testing read from shared filtered chunks in Compound Datatype dataset with Datatype conversion
Testing write file serially; read file in parallel

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 456838 RUNNING AT 50e3f0cc344e479fab9e2efaa21c33e7
=   EXIT CODE: 14
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Alarm clock (signal 14)

This is with: gcc-13.2.1-4.fc40.x86_64 mpich-4.1.2-7.fc40.x86_64

jhendersonHDF commented 1 year ago

Hi @opoplawski, hopefully https://github.com/HDFGroup/hdf5/commit/af56339d3bb0ba0076c10f929472f766c9a9a5af should fix this, at least for the t_filters_parallel test.. It should be merged back to the hdf5_1_14 branch soon.

opoplawski commented 8 months ago

I don't seem to be seeing it with the latest hdf5_1_14 branch and latest Fedora Rawhide.

opoplawski commented 8 months ago

I take this back as well. It seems to succeed in the Fedora koji builders which have more cores. But it hangs in the COPR builders which appear to only have two cores:

make[4]: Entering directory '/builddir/build/BUILD/hdf5-hdf5_1_14/mpich/testpar'
============================
Testing: t_cache_image 
 !! Copr timeout => sending INT
Copr build error: Build failed