Open opoplawski opened 1 year ago
Looks like t_cache_image test hung.
Looks like t_cache_image test hung.
t_cache_image in one run, t_filters_parallel in another timed out.
@opoplawski Would it be possible to try the same builds using at least MPICH 4.0.3 to see if there were any bugs fixed that may have caused these issues?
We've noticed a bug in mpich 4.0-4.0.3 using the default "ch4:ofi" device. Last we checked it had been fixed in the 4.1 branch. If you need to use 4.0-4.0.3 you can work around it by building mpich with "--with-device=ch4:ucx" or "--with-device=ch3"
Edit: Just confirmed the 4.1 release passes with ch4:ofi
Looks like the Fedora mpich package is built with --with-device=ch3:nemesis
@opoplawski - Is this still a problem with the hdf5_1_14 branch?
Loos like it is:
make[4]: Entering directory '/builddir/build/BUILD/hdf5-hdf5_1_14/mpich/testpar'
============================
Testing: t_filters_parallel
============================
Test log for t_filters_parallel
============================
==========================
Parallel Filters tests
==========================
** Some tests will be skipped due to TestExpress setting.
** Exhaustive tests will only be performed for the first available filter.
** Set the HDF5TestExpress environment variable to 0 to perform exhaustive testing for all available filters.
Test Info:
MPI size: 6
Test express level: 3
Using seed: 1697792661
...
== Running tests in mode 'USE_MULTIPLE_DATASETS_MIXED_FILTERED' with filter 'Deflate' using selection I/O mode 'on', 'Multi-Chunk I/O' and 'Early' allocation time ==
Testing write to one-chunk filtered dataset
Testing write to unshared filtered chunks
Testing partial write to unshared filtered chunks
Testing write to shared filtered chunks
Testing write to unshared filtered chunks w/ single unlimited dimension
Testing write to shared filtered chunks w/ single unlimited dimension
Testing write to unshared filtered chunks w/ two unlimited dimensions
Testing write to shared filtered chunks w/ two unlimited dimensions
Testing write to filtered chunks with a single process having no selection
Testing write to filtered chunks with all processes having no selection
Testing write to filtered chunks with point selection
Testing interleaved write to filtered chunks
Testing write to unshared transformed and filtered chunks
Testing write to unshared filtered chunks on separate pages in 3D dataset
Testing write to unshared filtered chunks on the same pages in 3D dataset
Testing write to shared filtered chunks in 3D dataset
Testing write to unshared filtered chunks in Compound Datatype dataset without Datatype conversion
Testing write to shared filtered chunks in Compound Datatype dataset without Datatype conversion
Testing write to unshared filtered chunks in Compound Datatype dataset with Datatype conversion
Testing write to shared filtered chunks in Compound Datatype dataset with Datatype conversion
Testing read from one-chunk filtered dataset
Testing read from unshared filtered chunks
Testing read from shared filtered chunks
Testing read from filtered chunks with a single process having no selection
Testing read from filtered chunks with all processes having no selection
Testing read from filtered chunks with point selection
Testing interleaved read from filtered chunks
Testing read from unshared transformed and filtered chunks
Testing read from unshared filtered chunks on separate pages in 3D dataset
Testing read from unshared filtered chunks on the same pages in 3D dataset
Testing read from shared filtered chunks in 3D dataset
Testing read from unshared filtered chunks in Compound Datatype dataset without Datatype conversion
Testing read from shared filtered chunks in Compound Datatype dataset without Datatype conversion
Testing read from unshared filtered chunks in Compound Datatype dataset with Datatype conversion
Testing read from shared filtered chunks in Compound Datatype dataset with Datatype conversion
Testing write file serially; read file in parallel
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 456838 RUNNING AT 50e3f0cc344e479fab9e2efaa21c33e7
= EXIT CODE: 14
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Alarm clock (signal 14)
This is with: gcc-13.2.1-4.fc40.x86_64 mpich-4.1.2-7.fc40.x86_64
Hi @opoplawski, hopefully https://github.com/HDFGroup/hdf5/commit/af56339d3bb0ba0076c10f929472f766c9a9a5af should fix this, at least for the t_filters_parallel test.. It should be merged back to the hdf5_1_14 branch soon.
I don't seem to be seeing it with the latest hdf5_1_14 branch and latest Fedora Rawhide.
I take this back as well. It seems to succeed in the Fedora koji builders which have more cores. But it hangs in the COPR builders which appear to only have two cores:
make[4]: Entering directory '/builddir/build/BUILD/hdf5-hdf5_1_14/mpich/testpar'
============================
Testing: t_cache_image
!! Copr timeout => sending INT
Copr build error: Build failed
Describe the bug Looking at updating the Fedora hdf5 package to 1.14.0. Seeing the following test failure:
Platform (please complete the following information)
Test builds are happening here: https://copr.fedorainfracloud.org/coprs/orion/hdf5-1.14/builds and full logs can be found there.