ecmwf / eckit

A C++ toolkit that supports development of tools and applications at ECMWF.
https://confluence.ecmwf.int/display/eckit
Apache License 2.0
23 stars 25 forks source link

MPI communicator split failures #125

Open DJDavies2 opened 4 months ago

DJDavies2 commented 4 months ago

What happened?

I am getting failures of this type:

Completed case 0: Test MPI Communicator Split
0 tests failed out of 1.
Completed case 0: Test MPI Communicator Split
0 tests failed out of 1.
Completed case 0: Test MPI Communicator Split
0 tests failed out of 1.
Completed case 0: Test MPI Communicator Split
0 tests failed out of 1.

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 16084 RUNNING AT expspicesrv053
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Tests that produce this error are e.g. eckit_test_mpi_splitcomm, eckit_test_mpi_group or eckit_test_mpi_internal_access.

What are the steps to reproduce the bug?

Build and run ctests. It seems that the problems occur with mpich but not with openmpi.

Version

develop

Platform (OS and architecture)

Linux

Relevant log output

No response

Accompanying data

No response

Organisation

Met Office

wdeconinck commented 3 months ago

Probably also related to https://github.com/ecmwf/fckit/issues/41 In that issue there's mention of explicit warnings like:

[WARNING] yaksa: 2 leaked handle pool objects

This yaksa is apparently a memory pool used in MPICH. My hunch is that the eckit approach of calling MPI_Finalize during the destruction of static objects (after main) does not play nice with MPICH. @tlmquintino do you have any suggestion?