flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

PMI_Abort job hangs #3277

Closed garlick closed 6 months ago

garlick commented 4 years ago

On our coffee call, @dongahn and @stevwonder mentioned a problem with job hangs when PMI_Abort() is called. I did not find an open issue so wanted to open this as a placeholder.

For a quick test, src/common/libpmi/test_pmi_info does have a mode where a selected rank calls PMI_Abort(). Here's an example of that:

$ flux mini run -N8 -n8 ./test_pmi_info -a 7
0.090s: flux-shell[7]: FATAL: MPI_Abort: Test abort error. ok. yeah!
0.095s: job.exception type=exec severity=0 MPI_Abort: Test abort error. ok. yeah!
6: size=8 appnum=0 maxes=64:64:1024 kvsname=ƒDxdEtYf
5: size=8 appnum=0 maxes=64:64:1024 kvsname=ƒDxdEtYf
1: size=8 appnum=0 maxes=64:64:1024 kvsname=ƒDxdEtYf
3: size=8 appnum=0 maxes=64:64:1024 kvsname=ƒDxdEtYf
4: size=8 appnum=0 maxes=64:64:1024 kvsname=ƒDxdEtYf
2: size=8 appnum=0 maxes=64:64:1024 kvsname=ƒDxdEtYf
0: size=8 appnum=0 maxes=64:64:1024 kvsname=ƒDxdEtYf
flux-job: task(s) exited with exit code 1
$

If we can get more info on these hangs (or reference an open issue if I missed it), I'd appreciate it.

dongahn commented 4 years ago

@garlick: I forwarded you one discussion thread in our mailing list. May or may not be the same issue our COVID-19 drug design workflow is having.

dongahn commented 4 years ago

@garlick: I couldn't find all the details for the COVID-19 hang case. As I recall, the problem was: @XiaohuaZhangLLNL ran a DAT on Quartz and one 25-node job submitted into the Flux instance got hung. He reported that the job was supported to raise an exception or crash. So we translated this to calling PMI_Abort() which may not have been the case.

@XiaohuaZhangLLNL: you may recall this problem better than I do. Do you know when this condition happens your job would exit or crash? I vaguely remember it read a corrupted HDF file?

SteVwonder commented 4 years ago

I vaguely remember it read a corrupted HDF file?

I remember the same. It failed to read a file, printed some errors to stderr which didn't appear due to stderr buffering (https://github.com/flux-framework/flux-core/issues/3041), and then hung.

So we translated this to calling PMI_Abort() which may not have been the case.

In an email thread we mentioned that we thought MPI_Abort was called and then the hang occurred, but I also can't remember why we thought it was MPI_Abort that was at fault. Did we have a stack trace?

dongahn commented 4 years ago

FYI -- @XiaohuaZhangLLNL gave me his reproducer some time back and I will try to triage. The repro is at 75 nodes so it may take some time to get the resources, though.

dongahn commented 4 years ago

I ran his repro for 4 hours and checked the state. One job that should hang didn't hang. But its output is interesting.

zhang30@quartz1154 x82]$ cat flux-67729620992.out
[0-8]: stdout redirected to job-{{jobid}}.out
[0-8]: stderr redirected to job-{{jobid}}.out
2.574s: flux-shell[0]: FATAL: MPI_Abort: application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
2.577s: job.exception type=exec severity=0 MPI_Abort: application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
2.584s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
2.584s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
2.585s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
2.585s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
2.589s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
2.590s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
2.590s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
2.590s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
2.592s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
2.593s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
2.589s: flux-shell[2]: ERROR: shell_output_write: Function not implemented
2.589s: flux-shell[2]: ERROR: shell_output_write: Function not implemented
2.590s: flux-shell[2]: ERROR: shell_output_write: Function not implemented
2.591s: flux-shell[2]: ERROR: shell_output_write: Function not implemented
2.591s: flux-shell[2]: ERROR: shell_output_write: Function not implemented
2.591s: flux-shell[2]: ERROR: shell_output_write: Function not implemented
2.593s: flux-shell[2]: ERROR: shell_output_write: Function not implemented
2.593s: flux-shell[2]: ERROR: shell_output_write: Function not implemented
2.597s: flux-shell[2]: ERROR: shell_output_write: Function not implemented
2.598s: flux-shell[2]: ERROR: shell_output_write: Function not implemented
2.598s: flux-shell[2]: ERROR: shell_output_write: Function not implemented
2.589s: flux-shell[6]: ERROR: shell_output_write: Function not implemented
2.589s: flux-shell[6]: ERROR: shell_output_write: Function not implemented
2.589s: flux-shell[13]: ERROR: shell_output_write: Function not implemented
2.590s: flux-shell[13]: ERROR: shell_output_write: Function not implemented
2.587s: flux-shell[5]: ERROR: shell_output_write: Function not implemented
2.587s: flux-shell[5]: ERROR: shell_output_write: Function not implemented
2.587s: flux-shell[5]: ERROR: shell_output_write: Function not implemented
2.587s: flux-shell[5]: ERROR: shell_output_write: Function not implemented
2.588s: flux-shell[5]: ERROR: shell_output_write: Function not implemented
2.589s: flux-shell[5]: ERROR: shell_output_write: Function not implemented
2.592s: flux-shell[5]: ERROR: shell_output_write: Function not implemented
2.592s: flux-shell[5]: ERROR: shell_output_write: Function not implemented
2.595s: flux-shell[5]: ERROR: shell_output_write: Function not implemented
2.595s: flux-shell[5]: ERROR: shell_output_write: Function not implemented
2.596s: flux-shell[5]: ERROR: shell_output_write: Function not implemented
2.597s: flux-shell[5]: ERROR: shell_output_write: Function not implemented
2.585s: flux-shell[4]: ERROR: shell_output_write: Function not implemented
2.585s: flux-shell[4]: ERROR: shell_output_write: Function not implemented
2.587s: flux-shell[4]: ERROR: shell_output_write: Function not implemented
2.587s: flux-shell[4]: ERROR: shell_output_write: Function not implemented
2.587s: flux-shell[4]: ERROR: shell_output_write: Function not implemented
2.587s: flux-shell[4]: ERROR: shell_output_write: Function not implemented
2.597s: flux-shell[5]: ERROR: shell_output_write: Function not implemented
2.598s: flux-shell[5]: ERROR: shell_output_write: Function not implemented
2.598s: flux-shell[5]: ERROR: shell_output_write: Function not implemented
2.599s: flux-shell[5]: ERROR: shell_output_write: Function not implemented
2.599s: flux-shell[2]: ERROR: shell_output_write: Function not implemented
2.599s: flux-shell[2]: ERROR: shell_output_write: Function not implemented
2.599s: flux-shell[2]: ERROR: shell_output_write: Function not implemented
2.600s: flux-shell[2]: ERROR: shell_output_write: Function not implemented
2.601s: flux-shell[2]: ERROR: shell_output_write: Function not implemented
2.607s: flux-shell[2]: ERROR: shell_output_write: Function not implemented
2.607s: flux-shell[2]: ERROR: shell_output_write: Function not implemented
2.596s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
2.597s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
2.598s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
2.599s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
2.600s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
2.601s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
2.606s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
2.604s: flux-shell[6]: ERROR: shell_output_write: Function not implemented
2.604s: flux-shell[6]: ERROR: shell_output_write: Function not implemented
2.606s: flux-shell[6]: ERROR: shell_output_write: Function not implemented
2.606s: flux-shell[6]: ERROR: shell_output_write: Function not implemented
2.609s: flux-shell[6]: ERROR: shell_output_write: Function not implemented
2.611s: flux-shell[6]: ERROR: shell_output_write: Function not implemented
2.613s: flux-shell[6]: ERROR: shell_output_write: Function not implemented
2.605s: flux-shell[13]: ERROR: shell_output_write: Function not implemented
2.606s: flux-shell[13]: ERROR: shell_output_write: Function not implemented
2.608s: flux-shell[13]: ERROR: shell_output_write: Function not implemented
2.609s: flux-shell[13]: ERROR: shell_output_write: Function not implemented
2.613s: flux-shell[13]: ERROR: shell_output_write: Function not implemented
2.613s: flux-shell[13]: ERROR: shell_output_write: Function not implemented
2.613s: flux-shell[13]: ERROR: shell_output_write: Function not implemented
2.614s: flux-shell[13]: ERROR: shell_output_write: Function not implemented
2.615s: flux-shell[13]: ERROR: shell_output_write: Function not implemented
2.611s: flux-shell[5]: ERROR: shell_output_write: Function not implemented
2.612s: flux-shell[5]: ERROR: shell_output_write: Function not implemented
2.615s: flux-shell[6]: ERROR: shell_output_write: Function not implemented
2.618s: flux-shell[6]: ERROR: shell_output_write: Function not implemented
2.619s: flux-shell[6]: ERROR: shell_output_write: Function not implemented
2.620s: flux-shell[6]: ERROR: shell_output_write: Function not implemented
2.622s: flux-shell[6]: ERROR: shell_output_write: Function not implemented
2.623s: flux-shell[6]: ERROR: shell_output_write: Function not implemented
2.623s: flux-shell[6]: ERROR: shell_output_write: Function not implemented
2.616s: flux-shell[13]: ERROR: shell_output_write: Function not implemented
2.617s: flux-shell[13]: ERROR: shell_output_write: Function not implemented
2.618s: flux-shell[13]: ERROR: shell_output_write: Function not implemented
2.619s: flux-shell[13]: ERROR: shell_output_write: Function not implemented
2.620s: flux-shell[13]: ERROR: shell_output_write: Function not implemented
2.628s: flux-shell[6]: ERROR: shell_output_write: Function not implemented
2.630s: flux-shell[6]: ERROR: shell_output_write: Function not implemented
2.631s: flux-shell[13]: ERROR: shell_output_write: Function not implemented
2.632s: flux-shell[13]: ERROR: shell_output_write: Function not implemented
2.591s: flux-shell[22]: ERROR: shell_output_write: Function not implemented
2.591s: flux-shell[22]: ERROR: shell_output_write: Function not implemented
2.593s: flux-shell[21]: ERROR: shell_output_write: Function not implemented
2.594s: flux-shell[21]: ERROR: shell_output_write: Function not implemented
2.592s: flux-shell[8]: ERROR: shell_output_write: Function not implemented
2.592s: flux-shell[8]: ERROR: shell_output_write: Function not implemented
2.601s: flux-shell[8]: ERROR: shell_output_write: Function not implemented
2.601s: flux-shell[8]: ERROR: shell_output_write: Function not implemented
2.598s: flux-shell[4]: ERROR: shell_output_write: Function not implemented
2.599s: flux-shell[4]: ERROR: shell_output_write: Function not implemented
2.599s: flux-shell[4]: ERROR: shell_output_write: Function not implemented
2.600s: flux-shell[4]: ERROR: shell_output_write: Function not implemented
2.606s: flux-shell[4]: ERROR: shell_output_write: Function not implemented
2.594s: flux-shell[3]: ERROR: shell_output_write: Function not implemented
2.594s: flux-shell[3]: ERROR: shell_output_write: Function not implemented
2.596s: flux-shell[3]: ERROR: shell_output_write: Function not implemented
2.598s: flux-shell[3]: ERROR: shell_output_write: Function not implemented
2.599s: flux-shell[3]: ERROR: shell_output_write: Function not implemented
2.599s: flux-shell[3]: ERROR: shell_output_write: Function not implemented
2.601s: flux-shell[3]: ERROR: shell_output_write: Function not implemented
2.602s: flux-shell[3]: ERROR: shell_output_write: Function not implemented
2.604s: flux-shell[8]: ERROR: shell_output_write: Function not implemented
2.607s: flux-shell[8]: ERROR: shell_output_write: Function not implemented
2.609s: flux-shell[8]: ERROR: shell_output_write: Function not implemented
2.611s: flux-shell[8]: ERROR: shell_output_write: Function not implemented
2.605s: flux-shell[3]: ERROR: shell_output_write: Function not implemented
2.607s: flux-shell[3]: ERROR: shell_output_write: Function not implemented
2.609s: flux-shell[3]: ERROR: shell_output_write: Function not implemented
2.610s: flux-shell[3]: ERROR: shell_output_write: Function not implemented
2.613s: flux-shell[3]: ERROR: shell_output_write: Function not implemented
2.615s: flux-shell[3]: ERROR: shell_output_write: Function not implemented
2.607s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
2.608s: flux-shell[4]: ERROR: shell_output_write: Function not implemented
2.609s: flux-shell[4]: ERROR: shell_output_write: Function not implemented
2.611s: flux-shell[4]: ERROR: shell_output_write: Function not implemented
2.612s: flux-shell[4]: ERROR: shell_output_write: Function not implemented
2.614s: flux-shell[4]: ERROR: shell_output_write: Function not implemented
2.617s: flux-shell[3]: ERROR: shell_output_write: Function not implemented
2.618s: flux-shell[3]: ERROR: shell_output_write: Function not implemented
2.624s: flux-shell[4]: ERROR: shell_output_write: Function not implemented
2.626s: flux-shell[4]: ERROR: shell_output_write: Function not implemented
2.626s: flux-shell[22]: ERROR: shell_output_write: Function not implemented
2.627s: flux-shell[22]: ERROR: shell_output_write: Function not implemented
2.634s: flux-shell[22]: ERROR: shell_output_write: Function not implemented
2.627s: flux-shell[8]: ERROR: shell_output_write: Function not implemented
2.630s: flux-shell[8]: ERROR: shell_output_write: Function not implemented
2.634s: flux-shell[8]: ERROR: shell_output_write: Function not implemented
2.628s: flux-shell[3]: ERROR: shell_output_write: Function not implemented
2.630s: flux-shell[3]: ERROR: shell_output_write: Function not implemented
2.629s: flux-shell[21]: ERROR: shell_output_write: Function not implemented
2.631s: flux-shell[21]: ERROR: shell_output_write: Function not implemented
2.635s: flux-shell[21]: ERROR: shell_output_write: Function not implemented
2.636s: flux-shell[21]: ERROR: shell_output_write: Function not implemented
2.637s: flux-shell[22]: ERROR: shell_output_write: Function not implemented
2.639s: flux-shell[22]: ERROR: shell_output_write: Function not implemented
2.640s: flux-shell[22]: ERROR: shell_output_write: Function not implemented
2.646s: flux-shell[22]: ERROR: shell_output_write: Function not implemented
2.637s: flux-shell[8]: ERROR: shell_output_write: Function not implemented
2.641s: flux-shell[8]: ERROR: shell_output_write: Function not implemented
2.644s: flux-shell[8]: ERROR: shell_output_write: Function not implemented
2.640s: flux-shell[21]: ERROR: shell_output_write: Function not implemented
2.642s: flux-shell[21]: ERROR: shell_output_write: Function not implemented
2.647s: flux-shell[22]: ERROR: shell_output_write: Function not implemented
2.648s: flux-shell[22]: ERROR: shell_output_write: Function not implemented
2.650s: flux-shell[22]: ERROR: shell_output_write: Function not implemented
2.651s: flux-shell[22]: ERROR: shell_output_write: Function not implemented
2.652s: flux-shell[22]: ERROR: shell_output_write: Function not implemented
2.653s: flux-shell[22]: ERROR: shell_output_write: Function not implemented
2.654s: flux-shell[22]: ERROR: shell_output_write: Function not implemented
2.648s: flux-shell[8]: ERROR: shell_output_write: Function not implemented
2.652s: flux-shell[8]: ERROR: shell_output_write: Function not implemented
2.658s: flux-shell[21]: ERROR: shell_output_write: Function not implemented
2.661s: flux-shell[21]: ERROR: shell_output_write: Function not implemented
2.663s: flux-shell[21]: ERROR: shell_output_write: Function not implemented
2.664s: flux-shell[21]: ERROR: shell_output_write: Function not implemented
2.666s: flux-shell[21]: ERROR: shell_output_write: Function not implemented
2.666s: flux-shell[21]: ERROR: shell_output_write: Function not implemented
2.669s: flux-shell[22]: ERROR: shell_output_write: Function not implemented
2.670s: flux-shell[22]: ERROR: shell_output_write: Function not implemented
2.670s: flux-shell[8]: ERROR: shell_output_write: Function not implemented
2.671s: flux-shell[8]: ERROR: shell_output_write: Function not implemented
2.672s: flux-shell[21]: ERROR: shell_output_write: Function not implemented
2.673s: flux-shell[21]: ERROR: shell_output_write: Function not implemented
2.674s: flux-shell[21]: ERROR: shell_output_write: Function not implemented
2.676s: flux-shell[21]: ERROR: shell_output_write: Function not implemented
flux-job: task(s) exited with exit code 143
2020-11-13T19:40:59.546173Z broker.err[0]: rc2.0: /bin/bash -c /var/tmp/zhang30/flux-script-67729620992-PQSdJE Exited (rc=143) 3.7s
dongahn commented 4 years ago

At the top-level Flux, flux jobs -a reports:

    ƒ2nC31f5 zhang30  CDT3Dockin  F     25     25   6.083s [0-24]

So this job only ran 6 seconds.

dongahn commented 4 years ago

Its eventlog seems normal...

[zhang30@quartz1154 flux]$ flux job eventlog --path=guest.exec.eventlog ƒ2nC31f5
1605296454.188744 init
1605296454.204379 starting
1605296454.417976 shell.init leader-rank=0 size=25 service="41228-shell-67729620992"
1605296454.433047 shell.start task-count=25
1605296460.267921 complete status=36608
1605296460.267971 cleanup.start ranks="all"
1605296460.270543 cleanup.finish ranks="all" rc=0
1605296460.271948 done
dongahn commented 4 years ago

I have 4PM with the user so I will see if the above output is the same one he saw during the hang.

grondo commented 4 years ago

Looks like shell_rank 0 exited before the other ranks had sent all their output. It may be that there is some missing logic to at least give a chance for all output to be flushed before exiting the rank 0 job shell.

Maybe a small reproducer would be a 2 node job where one task calls MPI_Abort and other tasks try to write a bunch of error output?

dongahn commented 4 years ago

Yeah I am suspecting some race condition like this. I will talk to the user to see if he can describe to me the failure mode more so I can reproduce the scale further down.

Could a race be led to a hang though?

grondo commented 4 years ago

Could a race be led to a hang though?

Yes, I suppose so if some job shells get stuck trying to exit.

dongahn commented 4 years ago

OK. More clarification from @XiaohuaZhangLLNL.

The hang he observed was the hang at the top-level Flux instance. Back then, he didn't check whether this nested flux batch job (submitted with flux mini batch) was hung or not.

So I will circle back to this reproducer (there are two nested flux jobs running and expected to complete before 8PM) and see if the top-level flux instance will be hung or not.

Nevertheless, this race should be further diagnosed and fixed.

dongahn commented 4 years ago

@XiaohuaZhangLLNL and @grondo : the top level flux instance ran to completion just fine when the other two nest flux jobs ran to completion. So my best guess is:

2.584s: flux-shell[1]: ERROR: shell_output_write: Function not implemented

The above error occurred back then which caused a racy hang. Then, this hang-up caused the top-level flux instance to hang in turn as well.

So I will focus on diagnosing and fixing the job-shell race problem first. Once that's fixed, I will bring that version to production and see if other hangs occur.

Thank you for working with me to diagnose this issue.

XiaohuaZhangLLNL commented 4 years ago

@dongahn Thank you! It is great that the job doesn't hang. Hope this hang-up will not happen in the future.

dongahn commented 3 years ago

Hope this hang-up will not happen in the future.

Just to make sure, I believe the hang-up can still occur if the MPI_Abort race bug above manifests itself as a hang (on the nest instance). I am trying to create a smaller reproducer at the moment. Thanks.

dongahn commented 3 years ago

@grondo:

I believe I reproduced this condition with a simple reproducer at a small scale with the following. When you look into this problem, I'd be curious whether it is possible for this race to lead to a hang. If it is, I would consider the hang-up that @XiaohuaZhangLLNL experienced was caused by that. Otherwise, there might be some additional bugs.

flux_mpi_abort_bug.c:

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>

void usage ()
{
    fprintf (stderr, "Usage: flux_mpi_abort_bug RANK ITER\n");
    fprintf (stderr, "  Test code where all of the MPI rank processes\n");
    fprintf (stderr, "  except the RANK process print one-line stdout output\n");
    fprintf (stderr, "  every 100 milliseconds.\n");
    fprintf (stderr, "  RANK prints one line and calls MPI_Abort immediately after.\n");
}

int main (int argc, char *argv[])
{
    int size, rank, iter_size, i, bad_rank;

    MPI_Init (&argc, &argv);
    MPI_Comm_size (MPI_COMM_WORLD, &size);
    MPI_Comm_rank (MPI_COMM_WORLD, &rank);

    if (argc != 3) {
        if (rank == 0)
            usage ();
        MPI_Finalize ();
        return EXIT_FAILURE;
    }

    bad_rank = atoi (argv[1]);
    iter_size = atoi (argv[2]);
    if (bad_rank < 0 || bad_rank >= size) {
        if (rank == 0)
            usage ();
        MPI_Finalize ();
        return EXIT_FAILURE;
    }

    fprintf (stdout, "[rank=%d] Starting...\n", rank);
    if (rank == bad_rank) {
        MPI_Abort (MPI_COMM_WORLD, 1);
    }

    for (i = 0; i < iter_size; ++i) {
        usleep (100000);
        fprintf (stdout, "[rank=%d] I am at iteration=%d\n", rank, i);
    }

    MPI_Finalize ();
    return EXIT_SUCCESS;
}

Makefile:

MPICC := mpicc
CFLAGS := -g -O0

all: flux_mpi_abort_bug

flux_mpi_abort_bug: flux_mpi_abort_bug.o
    $(MPICC) $(CFLAGS) $^ -o $@

flux_mpi_abort_bug.o: flux_mpi_abort_bug.c
    $(MPICC) $(CFLAGS) $^ -c -o $@

clean:
    rm *~ *.o
quartz764{dahn}29: /usr/global/tools/flux/toss_3_x86_64_ib/flux-c0.18.0-s0.10.0/bin/flux start -s 2 flux mini run -N 2 -n 6 flux_mpi_abort_bug 0 10
0.401s: flux-shell[0]: FATAL: MPI_Abort: application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
0.404s: job.exception type=exec severity=0 MPI_Abort: application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
0.409s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
0.409s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
0.412s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
0.412s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
0.412s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
0.413s: flux-shell[1]: ERROR: shell_output_write: Function not implemented
flux-job: task(s) exited with exit code 143
2020-11-16T21:14:42.466845Z broker.err[0]: rc2.0: flux mini run -N 2 -n 6 flux_mpi_abort_bug 0 10 Exited (rc=143) 1.0s
flux-start: 0 (pid 67236) exited with rc=143
dongahn commented 3 years ago

OK. I remembered another user experienced a hang when his code crashed so I created another reproducer by slightly modifying the above test code. With this, I was able to create a hang (one rank raises SIGSEGV). This seems to hang only with SLURM.

@XiaohuaZhangLLNL's case might have been that.

diff --git a/flux_mpi_abort_bug.c b/flux_crash_bug.c
index d0438a1..fee8f40 100644
--- a/flux_mpi_abort_bug.c
+++ b/flux_crash_bug.c
@@ -2,14 +2,15 @@
 #include <stdlib.h>
 #include <stdio.h>
 #include <mpi.h>
+#include <signal.h>

 void usage ()
 {
-    fprintf (stderr, "Usage: flux_mpi_abort_bug RANK ITER\n");
+    fprintf (stderr, "Usage: flux_crash_bug RANK ITER\n");
     fprintf (stderr, "  Test code where all of the MPI rank processes\n");
     fprintf (stderr, "  except the RANK process print one-line stdout output\n");
     fprintf (stderr, "  every 100 milliseconds.\n");
-    fprintf (stderr, "  RANK prints one line and calls MPI_Abort immediately after.\n");
+    fprintf (stderr, "  RANK prints one line and raise SIGSEGV immediately after.\n");
 }

 int main (int argc, char *argv[])
@@ -38,7 +39,7 @@ int main (int argc, char *argv[])

     fprintf (stdout, "[rank=%d] Starting...\n", rank);
     if (rank == bad_rank) {
-        MPI_Abort (MPI_COMM_WORLD, 1);
+        raise (SIGSEGV);
     }

     for (i = 0; i < iter_size; ++i) {
MPICC := mpicc
CFLAGS := -g -O0

all: flux_mpi_abort_bug flux_crash_bug

flux_mpi_abort_bug: flux_mpi_abort_bug.o
    $(MPICC) $(CFLAGS) $^ -o $@

flux_mpi_abort_bug.o: flux_mpi_abort_bug.c
    $(MPICC) $(CFLAGS) $^ -c -o $@

flux_crash_bug: flux_crash_bug.o
    $(MPICC) $(CFLAGS) $^ -o $@

flux_crash_bug.o: flux_crash_bug.c
    $(MPICC) $(CFLAGS) $^ -c -o $@

clean:
    rm *~ *.o
quartz764{dahn}60: srun -N 2 -n 2 --mpi=none --mpibind=off -ppdebug /usr/global/tools/flux/toss_3_x86_64_ib/flux-c0.18.0-s0.10.0/bin/flux start flux mini run -N 2 -n 2 ./flux_crash_bug 0 10
srun: job 6053575 queued and waiting for resources
srun: job 6053575 has been allocated resources
[quartz9:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)

<HANG>
grondo commented 3 years ago

For a severity=0 job exception, the job-exec module should send SIGTERM to all job shells, then after a timeout, SIGKILL. Therefore, in this simple case I can't think how the job would hang, unless there is some race where the job exception is missed by the job-exec module or even the MPI Abort isn't received by the shell, so no job exception is generated.

I do think we need to look into the rank 0 shell's handling of job-exceptions and see if it could wait, at least for some period, to accept any remaining output from other job shells.

grondo commented 3 years ago

We also have not yet implemented an early task exit notification mechanism for the job shell, which would let us generate an exception when a task unexpectedly exits (after a configurable timeout, equivalent to srun -W, --wait=sec option). This could make a job appear to hang if a single task dies and other tasks are blocked waiting for something.

dongahn commented 3 years ago

[quartz16:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)

Is this output from MPI? Would it be possible this MPI has some logic provided by SLURM or something in its SEGV signal handler?

dongahn commented 3 years ago

We also have not yet implemented an early task exit notification mechanism for the job shell, which would let us generate an exception when a task unexpectedly exits (after a configurable timeout, equivalent to srun -W, --wait=sec option). This could make a job appear to hang if a single task dies and other tasks are blocked waiting for something.

https://github.com/flux-framework/flux-core/issues/3277#issuecomment-728344066:

As a reference, the hang doesn't occur when you run the above reproducer with /usr/global/tools/flux/toss_3_x86_64_ib/flux-c0.18.0-s0.10.0/bin/flux start -s 4:

uartz188{dahn}21: flux mini run -N 4 -n 4 ./flux_crash_bug 0 10
[quartz188:mpi_rank_2][mv2_psm_err_handler] PSM error handler: Endpoint could not be reached : Some shared memory endpoints could not be connected because there is no shared memory PSM2 device (shm) in the currently enabled PSM2_DEVICES (self,hfi,self): quartz188:3.0., quartz188:3.0., quartz188:3.0.
psm_ep_connect failed with error Endpoint could not be reached
[quartz188:mpi_rank_2][psm_connect_alltoall] psm_connect_alltoall failed
[quartz188:mpi_rank_2][error_sighandler] Caught error: Segmentation fault (signal 11)
[quartz188:mpi_rank_0][mv2_psm_err_handler] PSM error handler: Endpoint could not be reached : Some shared memory endpoints could not be connected because there is no shared memory PSM2 device (shm) in the currently enabled PSM2_DEVICES (self,hfi,self): quartz188:4.0., quartz188:4.0., quartz188:4.0.
psm_ep_connect failed with error Endpoint could not be reached
[quartz188:mpi_rank_3][mv2_psm_err_handler] PSM error handler: Endpoint could not be reached : Some shared memory endpoints could not be connected because there is no shared memory PSM2 device (shm) in the currently enabled PSM2_DEVICES (self,hfi,self): quartz188:5.0., quartz188:5.0., quartz188:5.0.
psm_ep_connect failed with error Endpoint could not be reached
[quartz188:mpi_rank_3][psm_connect_alltoall] psm_connect_alltoall failed
[quartz188:mpi_rank_3][error_sighandler] Caught error: Segmentation fault (signal 11)
[quartz188:mpi_rank_0][psm_connect_alltoall] psm_connect_alltoall failed
[quartz188:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[quartz188:mpi_rank_1][mv2_psm_err_handler] PSM error handler: Endpoint could not be reached : Some shared memory endpoints could not be connected because there is no shared memory PSM2 device (shm) in the currently enabled PSM2_DEVICES (self,hfi,self): quartz188:6.0., quartz188:6.0., quartz188:6.0.
psm_ep_connect failed with error Endpoint could not be reached
[quartz188:mpi_rank_1][psm_connect_alltoall] psm_connect_alltoall failed
[quartz188:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
flux-job: task(s) exited with exit code 139
grondo commented 3 years ago

I don't think that error is from Slurm, but from MPI, or some other library that provides a segv handler, possibly to dump a backtrace instead of corefile? If the program segfaults but doesn't exit, no resource manager is going to be able to detect that.

Edit: Strange I wonder why the tasks exit under Flux but not Slurm?

dongahn commented 3 years ago

It did dump a core file. I kind of suspect maybe there is something in the signal handler that interact with SLURM... (local mod?)

dongahn commented 3 years ago

We also have not yet implemented an early task exit notification mechanism for the job shell, which would let us generate an exception when a task unexpectedly exits (after a configurable timeout, equivalent to srun -W, --wait=sec option). This could make a job appear to hang if a single task dies and other tasks are blocked waiting for something.

OK. I am concluding that the hang is due to this missing feature.

It looks like for some cases the PSM layer itself can detect that some end points have exited at which point all other MPI processes exit. But it doesn't look like we can rely on the underlying comms. layer to do this for general cases.

Do we have an open ticket for the early exit notification? It would come in handy since we can point users at the ticket if they report such hangs in the future.

garlick commented 3 years ago

FWIW, in #3678 I added an MPI_Abort() test similar to @dongahn's above and ran into hangs in CI with mpich-3.0 on centos 7.

On systems that successfully run this test (such as ubuntu focal with mpich-3.3), I observe that MPI_Abort() causes a PMI-1 wire protocol abort RPC, which the shell pmi plugin handles by calling shell_die(). shell_die() tries to send SIGKILL to other tasks on that shell, posts a fatal exception, and exits. I often see the shell_output_write errors mentioned above when shell 0 dies in this way. The job seems to reliably conclude shortly thereafter, I assume because the exception sets cleanup in motion in the job-exec module (though I didn't follow that code through).

On the hanging system, MPI_Abort() does not cause a PMI-1 wire protocol abort. It simply prints a message from the task and exits the task. I guess then it hangs because we don't yet handle early exit (#2238).

For the record, the mpich commit where the pmi-1 client behavior changed is https://github.com/pmodels/mpich/commit/b1e89abf9102a9690a4ce394442212e600332094.

OpenMPI works OK because the flux plugin translates PMI_Abort() to a dlopened call to the flux libpmi.so.

Looking at source for mvapich2 v2.3 here, it would appear that it should work properly. Since they don't publish a git tree with history, it's laborious to identify past versions that don't work, so I left that as an exercise for the reader.

garlick commented 6 months ago

I think this can be closed as we now have tests for PMI_Abort() as well as early exit detection.

grondo commented 6 months ago

Just adding a note here that if we make the default -o exit-timeout=none as proposed in #5820 then we should re-open this issue (or perhaps we don't fix #5820 because of it)