Flux 0.68.0: broken pipe for MPI subprocesses

ardangelo commented 2 hours ago

Testing the new release of Flux 0.68.0, I'm seeing a failure with MPI subprocesses that we didn't run into on Flux 0.67.0. The test is a simple MPI wrapper utility that will call MPI_Init, fork a subprocess, then wait for it to finish before calling MPI_Finalize. On multi-node jobs, some ranks will fail with a "broken pipe" error. The launched subprocesses on those ranks seem to exit immediately due to closed stdin. Running the subprocess directly, or the wrapper without MPI enabled, does not have an issue.

Output:

$ cc -Wall -g -O0 mpi_wrapper.c -o mpi_wrapper
$ flux run -n2 -N2 -t 2m ./mpi_wrapper /usr/bin/cat
291 passed MPI_Init forked pid 292
249 passed MPI_Init forked pid 250
291 calling MPI_Finalize
hello
12.005s: flux-shell[0]:  WARN: exception: shell rank 1 (on node2): Broken pipe
hello
249 calling MPI_Finalize
flux-job: job shell Broken pipe

Eventlog:

$ flux job eventlog ƒ6tjnXB3d
1731685214.089539 submit userid=0 urgency=16 flags=0 version=1
1731685214.104842 validate
1731685214.116735 depend
1731685214.116760 priority priority=16
1731685214.124820 alloc
1731685214.124862 prolog-start description="cray-pals-port-distributor"
1731685214.125588 cray_port_distribution ports=[11999,11998] random_integer=7401511950000771050
1731685214.125619 prolog-finish description="cray-pals-port-distributor" status=0
1731685214.144850 start
1731685220.763595 finish status=13
1731685220.768427 release ranks="all" final=true
1731685220.768468 free
1731685220.768484 clean

Wrapper source:

#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>
#include "mpi.h"

extern char ** environ;
int main(int argc, char *argv[]) {
    MPI_Init(&argc, &argv);

    pid_t pid = fork();
    if (pid == 0) {
        errno = 0;
        execve(argv[1], &argv[1], environ);
        return 1;

    } else {
        fprintf(stderr, "%d passed MPI_Init forked pid %d\n", getpid(), pid);
        wait(0);
    }

    fprintf(stderr, "%d calling MPI_Finalize\n", getpid());
    MPI_Finalize();

    return 0;
}

garlick commented 2 hours ago

Thanks! I played around with this a bit.

it doesn't reproduce with openmpi
it doesn't reproduce on hetchy with Cray MPI if I run with -o pmi=libpmi2 (which coerces cray MPI to use "simple PMI")
it DOES reproduce on hetchy with Cray MPI and PMI.

Note that we fixed this bug in 0.68, where the two copies of the stdio file descriptors were being passed to spawned user processes:

6415

Maybe that explains why this wasn't seen in prior releases?

garlick commented 28 minutes ago

This is pretty strange:

$ flux run -l -n2 ./testexec /usr/bin/cat /proc/self/fdinfo/0|dshbak -c
1: wait status=0
0: wait status=0
----------------
0
----------------
waiting for 3390300
pos:    0
flags:  02
mnt_id: 10
----------------
1
----------------
waiting for 3390301
pos:    0
flags:  0100000
mnt_id: 24

On rank 0 (the "good": stdin), flags are 2 (O_RDWR) and mnt_id is 10. On rank 1 (the "bad" stdin) flags are 0100000 (O_LARGEFILE) and mnt_id is 24.

When I run a non-failing case, all the ranks look like rank 0.

If I grab the flags with fcntl (o, O_GETFL) before MPI_Init(), they read 02 for both ranks.

So something inside MPI_Init() appears to be doing something to that file descriptor.

Could it be in Cray PMI?

flux-framework / flux-core

Flux 0.68.0: broken pipe for MPI subprocesses #6449

6415