Xilinx / ACCL

Alveo Collective Communication Library: MPI-like communication operations for Xilinx Alveo accelerators
https://accl.readthedocs.io/
Apache License 2.0
81 stars 26 forks source link

MPI_ABORT invoked at the end of test run with failures #104

Closed quetric closed 1 year ago

quetric commented 1 year ago

Observed on a test against the emulator, with 8 ranks. All tests run, some fail, test killed at the very end with:

[1,7]<stdout>:3 tests failed on rank 7 (skipped 1 tests).
[1,6]<stdout>:3 tests failed on rank 6 (skipped 1 tests).
[1,2]<stdout>:3 tests failed on rank 2 (skipped 1 tests).
[1,3]<stdout>:3 tests failed on rank 3 (skipped 1 tests).
[1,4]<stdout>:3 tests failed on rank 4 (skipped 1 tests).
[1,1]<stdout>:3 tests failed on rank 1 (skipped 1 tests).
[1,5]<stdout>:3 tests failed on rank 5 (skipped 1 tests).
[1,0]<stdout>:3 tests failed on rank 0 (skipped 1 tests).
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
Mellich commented 1 year ago

This is explicitly called in our test suite if tests fail here. I guess the idea is to make it easy to see that something went wrong. We may consider to change this and return a non-zero exit value instead.

quetric commented 1 year ago

this seems too heavy handed. We already report which tests failed, killing MPI only confuses things. I'll remove the abort

quetric commented 1 year ago

Fixes in 9d833db21da6 by using the number of failed tests as exit code, instead of invoking MPI abort