Open Flamefire opened 6 years ago
SUCCESS_OR_DIE
in combination with a timeout that is not GASPI_BLOCK
-> So my conclusion: Close: This is not a bug but the correct behavior.
TIMEOUT
to GASPI_BLOCK
(can even be done with the command line: -DTIMEOUT=GASPI_BLOCK
) it will just result in an endless block at that function as it never returns. If this is the expected behavior I'd agree with Andreas that the state vector is kind of useless to detect error conditions, see #30. But as also mentioned there even the GASPI spec has this example code with gaspi_barrier
: if (err == GASPI_TIMEOUT && error vector indicates error) goto ERROR_HANDLING;
gaspi_barrier
? I'd argue with all, but would also accept that every process has to communicate with at least one other. So at least one rank has to be able to detect the faulty rank.gaspi_barrier
. Each scaling implementation must not contact all other ranks. Each correct implementation must contact at least one other rank.SUCCESS_OR_DIE
is only valid in combination with GASPI_BLOCK
which in turn might block forever. To handle errors requires to use a finite timeout which in turn forbids the usage of SUCCESS_OR_DIE
. Programs that claim to be fault tolerant can't use GASPI_BLOCK
or SUCCESS_OR_DIE
.0: reported rank 1 as healthy
but you were right, that it got truncated while pasting. I edited the first post that shows that no corruption is detectedSUCCESS_OR_DIE
is defined by the programmer and in this case it was decided to treat a timeout as an error for the cases used which makes its use valid, but that is not the point here. SUCCESS_OR_DIE
is not used at the barrier calls in question but only for the setup.I want to stress that the problem reported here is that the current implementation of gaspi_barrier
is NOT able to detect a faulty rank, but simply reports a TIMEOUT or blocks forever. The spec says, that it updates, case required, the state vector. This makes one assume, that it can detect faulty ranks although one could argue, that the spec does not clearly say that it must. However this is also true for all other functions and if that was the case would mean that the automatic error detection of GASPI (not GPI2 in particular) is useless because it is unreliable: Functions can simply timeout and never set the error state
During investigation of #30 I found the function
gaspi_error
not conforming to the specs.The testcase is a barrier call after one process was killed.
Expected:
gaspi_barrier
returnsGASPI_ERROR
,gaspi_state_vec_get
returnsGASPI_CORRUPTED
for that rank Current behavior:gaspi_barrier
returnsGASPI_TIMEOUT
or hangs indefinitely,gaspi_state_vec_get
returnsGASPI_HEALTHY
for all ranks.Testcode: gpi2_barrier.c.txt
Example output of a run with 5 tasks:
Problem is that
pgaspi_dev_post_group_write
at https://github.com/cc-hpc-itwm/GPI-2/blob/v1.3.0/src/GPI2_GRP.c#L551 does not return an error and then https://github.com/cc-hpc-itwm/GPI-2/blob/v1.3.0/src/GPI2_GRP.c#L575 returns the timeout without further error checking.