cc-hpc-itwm / GPI-2

GPI-2
http://www.gpi-site.com/
GNU General Public License v3.0
53 stars 29 forks source link

Faulty behaviour of `gaspi_barrier` in case of rank errors #42

Open Flamefire opened 6 years ago

Flamefire commented 6 years ago

During investigation of #30 I found the function gaspi_error not conforming to the specs.

The testcase is a barrier call after one process was killed.

Expected: gaspi_barrier returns GASPI_ERROR, gaspi_state_vec_get returns GASPI_CORRUPTED for that rank Current behavior: gaspi_barrier returns GASPI_TIMEOUT or hangs indefinitely, gaspi_state_vec_get returns GASPI_HEALTHY for all ranks.

Testcode: gpi2_barrier.c.txt

Example output of a run with 5 tasks:

4: Started 5 ranks 4: Got timeout on barrier. Retrying... 4: Still timeout on barrier 4: reported rank 0 as healthy 4: reported rank 1 as healthy 4: reported rank 2 as healthy 4: reported rank 3 as healthy 4: reported rank 4 as healthy

Error on rank 4. Did NOT detect faulty rank.

3: Started 5 ranks 3: Got timeout on barrier. Retrying... 3: Still timeout on barrier 3: reported rank 0 as healthy 3: reported rank 1 as healthy 3: reported rank 2 as healthy 3: reported rank 3 as healthy 3: reported rank 4 as healthy

Error on rank 3. Did NOT detect faulty rank.

2: Started 5 ranks 2: Got timeout on barrier. Retrying... 2: Still timeout on barrier 2: reported rank 0 as healthy 2: reported rank 1 as healthy 2: reported rank 2 as healthy 2: reported rank 3 as healthy 2: reported rank 4 as healthy

Error on rank 2. Did NOT detect faulty rank.

0: Started 5 ranks 0: Got timeout on barrier. Retrying... 0: Still timeout on barrier 0: reported rank 0 as healthy 0: reported rank 1 as healthy 0: reported rank 2 as healthy 0: reported rank 3 as healthy 0: reported rank 4 as healthy

Error on rank 0. Did NOT detect faulty rank.

Problem is that pgaspi_dev_post_group_write at https://github.com/cc-hpc-itwm/GPI-2/blob/v1.3.0/src/GPI2_GRP.c#L551 does not return an error and then https://github.com/cc-hpc-itwm/GPI-2/blob/v1.3.0/src/GPI2_GRP.c#L575 returns the timeout without further error checking.

mrahn commented 6 years ago

-> So my conclusion: Close: This is not a bug but the correct behavior.

Flamefire commented 6 years ago
mrahn commented 6 years ago
Flamefire commented 6 years ago

I want to stress that the problem reported here is that the current implementation of gaspi_barrier is NOT able to detect a faulty rank, but simply reports a TIMEOUT or blocks forever. The spec says, that it updates, case required, the state vector. This makes one assume, that it can detect faulty ranks although one could argue, that the spec does not clearly say that it must. However this is also true for all other functions and if that was the case would mean that the automatic error detection of GASPI (not GPI2 in particular) is useless because it is unreliable: Functions can simply timeout and never set the error state