Faulty behaviour of `gaspi_barrier` in case of rank errors

Flamefire commented 6 years ago

During investigation of #30 I found the function gaspi_error not conforming to the specs.

The testcase is a barrier call after one process was killed.

Expected: gaspi_barrier returns GASPI_ERROR, gaspi_state_vec_get returns GASPI_CORRUPTED for that rank Current behavior: gaspi_barrier returns GASPI_TIMEOUT or hangs indefinitely, gaspi_state_vec_get returns GASPI_HEALTHY for all ranks.

Testcode: gpi2_barrier.c.txt

Example output of a run with 5 tasks:

4: Started 5 ranks 4: Got timeout on barrier. Retrying... 4: Still timeout on barrier 4: reported rank 0 as healthy 4: reported rank 1 as healthy 4: reported rank 2 as healthy 4: reported rank 3 as healthy 4: reported rank 4 as healthy

Error on rank 4. Did NOT detect faulty rank.

3: Started 5 ranks 3: Got timeout on barrier. Retrying... 3: Still timeout on barrier 3: reported rank 0 as healthy 3: reported rank 1 as healthy 3: reported rank 2 as healthy 3: reported rank 3 as healthy 3: reported rank 4 as healthy

Error on rank 3. Did NOT detect faulty rank.

2: Started 5 ranks 2: Got timeout on barrier. Retrying... 2: Still timeout on barrier 2: reported rank 0 as healthy 2: reported rank 1 as healthy 2: reported rank 2 as healthy 2: reported rank 3 as healthy 2: reported rank 4 as healthy

Error on rank 2. Did NOT detect faulty rank.

0: Started 5 ranks 0: Got timeout on barrier. Retrying... 0: Still timeout on barrier 0: reported rank 0 as healthy 0: reported rank 1 as healthy 0: reported rank 2 as healthy 0: reported rank 3 as healthy 0: reported rank 4 as healthy

Error on rank 0. Did NOT detect faulty rank.

Problem is that pgaspi_dev_post_group_write at https://github.com/cc-hpc-itwm/GPI-2/blob/v1.3.0/src/GPI2_GRP.c#L551 does not return an error and then https://github.com/cc-hpc-itwm/GPI-2/blob/v1.3.0/src/GPI2_GRP.c#L575 returns the timeout without further error checking.

mrahn commented 6 years ago

the test program is suspicious: it uses SUCCESS_OR_DIE in combination with a timeout that is not GASPI_BLOCK
the spec says "An update is not guaranteed to update all entries in the state vector, but may only update the entries of the direct communication partners." -> So if ranks 2, 3 and 4 are not communicating with rank 1, then this is a valid outcome. The state vector is not meant to be consistent across the set of nodes (and it can't be in case of a split world)

-> So my conclusion: Close: This is not a bug but the correct behavior.

Flamefire commented 6 years ago

Feel free to change the define of TIMEOUT to GASPI_BLOCK (can even be done with the command line: -DTIMEOUT=GASPI_BLOCK) it will just result in an endless block at that function as it never returns. If this is the expected behavior I'd agree with Andreas that the state vector is kind of useless to detect error conditions, see #30. But as also mentioned there even the GASPI spec has this example code with gaspi_barrier: if (err == GASPI_TIMEOUT && error vector indicates error) goto ERROR_HANDLING;
With which ranks would you say a process shall communicate for a gaspi_barrier? I'd argue with all, but would also accept that every process has to communicate with at least one other. So at least one rank has to be able to detect the faulty rank.

mrahn commented 6 years ago

Communication patterns or algorithms used in collective operations are not defined. So it is not known which other ranks are contacted in a gaspi_barrier. Each scaling implementation must not contact all other ranks. Each correct implementation must contact at least one other rank.
The output above shows 3 times "Error on rank X" with X being one of 2,3,4 but not 0. Either the output is incomplete or rank 0 detected rank 1 being corrupted. If it is the former then please provide the complete output.
SUCCESS_OR_DIE is only valid in combination with GASPI_BLOCK which in turn might block forever. To handle errors requires to use a finite timeout which in turn forbids the usage of SUCCESS_OR_DIE. Programs that claim to be fault tolerant can't use GASPI_BLOCK or SUCCESS_OR_DIE.

Flamefire commented 6 years ago

I realized that, thats why I added the 2nd part: One rank has to contact the faulty rank and hence detect the fault.
The output showed 0: reported rank 1 as healthy but you were right, that it got truncated while pasting. I edited the first post that shows that no corruption is detected
SUCCESS_OR_DIE is defined by the programmer and in this case it was decided to treat a timeout as an error for the cases used which makes its use valid, but that is not the point here. SUCCESS_OR_DIE is not used at the barrier calls in question but only for the setup.

I want to stress that the problem reported here is that the current implementation of gaspi_barrier is NOT able to detect a faulty rank, but simply reports a TIMEOUT or blocks forever. The spec says, that it updates, case required, the state vector. This makes one assume, that it can detect faulty ranks although one could argue, that the spec does not clearly say that it must. However this is also true for all other functions and if that was the case would mean that the automatic error detection of GASPI (not GPI2 in particular) is useless because it is unreliable: Functions can simply timeout and never set the error state

cc-hpc-itwm / GPI-2

Faulty behaviour of `gaspi_barrier` in case of rank errors #42