LLNL / LaunchMON

LaunchMON is a software infrastructure that enables HPC run-time tools to co-locate tool daemons with a parallel job. Its API allows a tool to identify all the remote processes of a job and to scalably launch daemons into the relevant nodes.
Other
13 stars 9 forks source link

LaunchMON hang #36

Open jeffreybquinn opened 6 years ago

jeffreybquinn commented 6 years ago

We have recently upgraded our internal development cluster (Xeon Skylake Gold 6140/38 with SLES12 SP2). In rebuilding the STAT debug tool and its dependencies such as LaunchMON, we've encountered hang failures for LaunchMON smoketests test.attach_1 & test.launch_1.

Versions in use: LaunchMon 1.0.2 gcc 5.4.0 openmpi 1.10.7 slurm 16.05.10-2

(These are the versions specified by our BKC build recipe. Our plan is to stage updating to newest versions after the baseline has been re-established.)

For our debug, we were hoping to gain access to logs/traces of successful runs of these two smoke tests on a similar configuration. We believe a differential analysis of this sort can help point us toward the configuration and build settings we need to adjust. We are additionally collecting strace logs to narrow down the hang point, but having trouble interpreting due to lack of in-depth familiarity with test operation and library operation.

lee218llnl commented 6 years ago

FWIW, below is a successful run of test.attach_1 (ignore those first 4 srun errors). Can you attach to the hung process and give a stack trace of where the test hangs? Any more output info from the test runs would be helpful too.

bash-4.2$ ./test.attach_1 srun: error: DisableRootJobs specified more than once, latest value used srun: error: SwitchType specified more than once, latest value used srun: error: DisableRootJobs specified more than once, latest value used srun: error: SwitchType specified more than once, latest value used [LMON BE] signum: 10, argv[1]: 10 argc(2) [LMON BE] signum: 10, argv[1]: 10 argc(2) [LMON BE(0)] Target process: 203047, MPI RANK: 0 [LMON BE(0)] Target process: 203048, MPI RANK: 1 [LMON BE(0)] Target process: 203049, MPI RANK: 2 [LMON BE(0)] Target process: 203050, MPI RANK: 3 [LMON BE(0)] Target process: 203051, MPI RANK: 4 [LMON BE(0)] Target process: 203052, MPI RANK: 5 [LMON BE(0)] Target process: 203053, MPI RANK: 6 [LMON BE(0)] Target process: 203054, MPI RANK: 7 [LMON BE(0)] Target process: 203055, MPI RANK: 8 [LMON BE(0)] Target process: 203056, MPI RANK: 9 [LMON BE(0)] Target process: 203057, MPI RANK: 10 [LMON BE(0)] Target process: 203058, MPI RANK: 11 [LMON BE(0)] Target process: 203059, MPI RANK: 12 [LMON BE(0)] Target process: 203060, MPI RANK: 13 [LMON BE(0)] Target process: 203061, MPI RANK: 14 [LMON BE(0)] Target process: 203062, MPI RANK: 15 [LMON BE(1)] Target process: 203657, MPI RANK: 16 [LMON BE(1)] Target process: 203658, MPI RANK: 17 [LMON BE(1)] Target process: 203659, MPI RANK: 18 [LMON BE(1)] Target process: 203660, MPI RANK: 19 [LMON BE(1)] Target process: 203661, MPI RANK: 20 [LMON BE(1)] Target process: 203662, MPI RANK: 21 [LMON BE(1)] Target process: 203663, MPI RANK: 22 [LMON BE(1)] Target process: 203664, MPI RANK: 23 [LMON BE(1)] Target process: 203665, MPI RANK: 24 [LMON BE(1)] Target process: 203666, MPI RANK: 25 [LMON BE(1)] Target process: 203667, MPI RANK: 26 [LMON BE(1)] Target process: 203668, MPI RANK: 27 [LMON BE(1)] Target process: 203669, MPI RANK: 28 [LMON BE(1)] Target process: 203670, MPI RANK: 29 [LMON BE(1)] Target process: 203671, MPI RANK: 30 [LMON BE(1)] Target process: 203672, MPI RANK: 31 [LMON FE] Please check the correctness of the following proctable [LMON FE] host_name: rzoz1 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203047(rank 0) [LMON FE] [LMON FE] host_name: rzoz1 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203048(rank 1) [LMON FE] [LMON FE] host_name: rzoz1 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203049(rank 2) [LMON FE] [LMON FE] host_name: rzoz1 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203050(rank 3) [LMON FE] [LMON FE] host_name: rzoz1 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203051(rank 4) [LMON FE] [LMON FE] host_name: rzoz1 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203052(rank 5) [LMON FE] [LMON FE] host_name: rzoz1 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203053(rank 6) [LMON FE] [LMON FE] host_name: rzoz1 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203054(rank 7) [LMON FE] [LMON FE] host_name: rzoz1 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203055(rank 8) [LMON FE] [LMON FE] host_name: rzoz1 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203056(rank 9) [LMON FE] [LMON FE] host_name: rzoz1 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203057(rank 10) [LMON FE] [LMON FE] host_name: rzoz1 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203058(rank 11) [LMON FE] [LMON FE] host_name: rzoz1 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203059(rank 12) [LMON FE] [LMON FE] host_name: rzoz1 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203060(rank 13) [LMON FE] [LMON FE] host_name: rzoz1 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203061(rank 14) [LMON FE] [LMON FE] host_name: rzoz1 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203062(rank 15) [LMON FE] [LMON FE] host_name: rzoz2 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203657(rank 16) [LMON FE] [LMON FE] host_name: rzoz2 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203658(rank 17) [LMON FE] [LMON FE] host_name: rzoz2 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203659(rank 18) [LMON FE] [LMON FE] host_name: rzoz2 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203660(rank 19) [LMON FE] [LMON FE] host_name: rzoz2 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203661(rank 20) [LMON FE] [LMON FE] host_name: rzoz2 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203662(rank 21) [LMON FE] [LMON FE] host_name: rzoz2 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203663(rank 22) [LMON FE] [LMON FE] host_name: rzoz2 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203664(rank 23) [LMON FE] [LMON FE] host_name: rzoz2 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203665(rank 24) [LMON FE] [LMON FE] host_name: rzoz2 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203666(rank 25) [LMON FE] [LMON FE] host_name: rzoz2 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203667(rank 26) [LMON FE] [LMON FE] host_name: rzoz2 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203668(rank 27) [LMON FE] [LMON FE] host_name: rzoz2 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203669(rank 28) [LMON FE] [LMON FE] host_name: rzoz2 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203670(rank 29) [LMON FE] [LMON FE] host_name: rzoz2 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203671(rank 30) [LMON FE] [LMON FE] host_name: rzoz2 [LMON FE] executable_name: /g/g0/lee218/src/launchmon/test/src/hang_on_SIGUSR1 [LMON FE] pid: 203672(rank 31) [LMON FE]

[LMON FE] Please check the correctness of the following resource handle

[LMON FE] [LMON FE] RM type is 1

[LMON FE] RM launcher's pid is 203029

[LMON FE] PASS: run through the end

jeffreybquinn commented 6 years ago

I believe my failure is prior to attachment to remote job. so I collected a backtrace from the coredump file instead:

(gdb) bt

0 0x00007f8009c1f8d7 in raise () from /lib64/libc.so.6

1 0x00007f8009c20caa in abort () from /lib64/libc.so.6

2 0x00007f800b32ef87 in cobo_connect_hostname ()

from /home/jbquinn/work_dir/stat-2.2.0/launchmon-v1.0.2/build/tools/cobo/src/.libs/libcobo.so.1

3 0x00007f800b33114f in cobo_server_open ()

from /home/jbquinn/work_dir/stat-2.2.0/launchmon-v1.0.2/build/tools/cobo/src/.libs/libcobo.so.1

4 0x00007f800b763228 in LMON_assist_ICCL_BE_init (mydesc=0x7f800b98b900 )

 at ../../../../../launchmon/src/linux/lmon_api/lmon_fe.cxx:1570

5 LMON_fe_beHandshakeSequence (sessionHandle=sessionHandle@entry=0, is_launch=is_launch@entry=false,

 febe_data=febe_data@entry=0x0, befe_data=befe_data@entry=0x0)
 at ../../../../../launchmon/src/linux/lmon_api/lmon_fe.cxx:1861

6 0x00007f800b76c001 in LMON_fe_attachAndSpawnDaemons (sessionHandle=0, hostname=hostname@entry=0x0,

 launcherPid=launcherPid@entry=333219, toolDaemon=<optimized out>, d_argv=d_argv@entry=0x7fff4dbb54b0, 
 febe_data=febe_data@entry=0x0, befe_data=0x0) at ../../../../../launchmon/src/linux/lmon_api/lmon_fe.cxx:5453

7 0x00000000004012c5 in main (argc=, argv=0x7fff4dbb5498)

 at ../../../test/src/fe_attach_smoketest.cxx:239

LaunchMON[334492]: LaunchMON security error in handshake: COBO/PMGR Handshake Security Error. My uid = xxx. Server at xxx:20101 took my connection from yyy:52020, but failed with error: Bad credential provided: Rewound credential


In an older run, I had enabled some tracing within the shell script called "fe_attach_smoketest". Several of the vars resolve to empty strings, which I speculate may be the result of build malfunctions. ... Line 203: test '' '!=' '%%%MAGIC variable%%%' Line 205: func_exec_program 333219 /home/jbquinn/work_dir/stat-2.2.0/launchmon-v1.0.2/build/test/src/be_kicker 10 Line 132: case " $* " in Line 143: func_exec_program_core 333219 /home/jbquinn/work_dir/stat-2.2.0/launchmon-v1.0.2/build/test/src/be_kicker 10 Line 117: test -n '' Line 121: exec /home/jbquinn/work_dir/stat-2.2.0/launchmon-v1.0.2/build/test/src/.libs/fe_attach_smoketest 333219 /home/jbquinn/work_dir/stat-2.2.0/launchmon-v1.0.2/build/test/src/be_kicker 10 ./test.attach_1: line 92: 334492 Aborted (core dumped) fe_attach_smoketest $PID pwd/be_kicker $SIGNUM srun: error: yyy: task 0: Aborted (core dumped)

dongahn commented 6 years ago

@jeffreybquinn: It seems LaunchMON front end fails to connect to the back end daemons. Could you configure with --enable-sec-none which disables secure handshaking and run the basic test again? https://github.com/LLNL/LaunchMON/blob/master/config/x_ac_handshake.m4#L42

jeffreybquinn commented 6 years ago

With the U.S. holidays now out of the way, there's lots to summarize here. With this change plus a few other changes from Ralph Castain's 12/12 email on the OpenMPI mailing list, we're able to build LaunchMON (gnu compilers, OpenMPI 1.10.7) and pass the LaunchMON smoke tests. :)

During execution of STAT script tests, we encounter new issues: 1.) During the "basic samples" stage of any EXE (after the "launch and sample" stage), we'll get this error from LaunchMON: <Jan 09 16:25:21> (INFO): Open a FIFO () failed, errno(2) After configurable period, the timeout occurs. I will begin peppering the code with calls to printf, LMON_say_msg, and self_trace_t::trace() to gain additional visibility. I'm not quite sure yet how to configure/build for maximum debug messages verbosity, though. 2.) Intermittently during the "launch and sample" stage, the EXE output appears instead of output from [handshake.c:\d\d\d], which again results in a timeout. It smells like a software race condition, presumably in the try block of STATapp::launch().

lee218llnl commented 6 years ago

Issue 1) is a reattach issue. One should be able to attach, detach, and reattach a debugger and expect the orterun to handle this properly. You are using quite an old version of OpenMPI, so I don't know if there is much hope in getting that version to work. I haven't tested it recently, but I seem to recall that this should work in more recent versions of OpenMPI. For issue 2), that doesn't give me a lot to go on, so I'm not sure how to help you there in the absence of more concise information.

dongahn commented 6 years ago

@jeffreybquinn: for OpenMPI/orterun, LaunchMON uses MPIR_attach_fifo support (http://mpi-forum.org/docs/mpir-specification-10-11-2010.pdf). It sends a small message to the FIFO opened by orterun asking this launcher to launch tool daemons on the nodes where the MPI processes are running. I believe what @lee218llnl characterizes above is accurate.