flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
166 stars 49 forks source link

debugging running flux with spindle #1514

Closed trws closed 6 years ago

trws commented 6 years ago

As a first cut to see if we could at least get started, Frank and I just went through a little testing on quartz, using the installed versions of both spindle and flux, to see what they would do if we put them together. The results were interesting, if not terribly successful, and I’m not quite sure where to even file this.

Just running flux directly with spindle and slurm apparently makes flux think it’s not installed, and triggers it to use the build-tree paths for everything, which triggers some odd errors:

$ spindle srun -N 1 flux broker
flux: `broker' is not a flux command.  See 'flux --help'
srun: error: quartz1400: task 0: Exited with exit code 1
$ spindle srun -N 1 flux env
export FLUX_PMI_LIBRARY_PATH="/builddir/build/BUILD/flux-core-0.8.0/src/common/.libs/libpmi.so"
export LUA_PATH="/builddir/build/BUILD/flux-core-0.8.0/src/bindings/lua/?.lua;;;"
export FLUX_EXEC_PATH="/builddir/build/BUILD/flux-core-0.8.0/src/broker:/builddir/build/BUILD/flux-core-0.8.0/src/cmd"
export FLUX_RC3_PATH="/builddir/build/BUILD/flux-core-0.8.0/etc/rc3"
export FLUX_SEC_DIRECTORY="/builddir/build/BUILD/flux-core-0.8.0/etc/flux"
export PYTHONPATH="/builddir/build/BUILD/flux-core-0.8.0/src/bindings/python/pycotap:/builddir/build/BUILD/flux-core-0.8.0/src/bindings/python"
export LUA_CPATH="/builddir/build/BUILD/flux-core-0.8.0/src/bindings/lua/?.so;;;"
export FLUX_CONNECTOR_PATH="/builddir/build/BUILD/flux-core-0.8.0/src/connectors"
export MANPATH="/builddir/build/BUILD/flux-core-0.8.0/doc:/usr/tce/packages/dotkit/dotkit/man:/usr/man:/usr/share/man:/usr/local/man:/usr/X11R6/man:/usr/lib64/mvapich/default/man"
export FLUX_WREXECD_PATH="/builddir/build/BUILD/flux-core-0.8.0/src/modules/wreck/wrexecd"
export FLUX_WRECK_LUA_PATTERN="/builddir/build/BUILD/flux-core-0.8.0/src/modules/wreck/lua.d/*.lua"
export FLUX_MODULE_PATH="/builddir/build/BUILD/flux-core-0.8.0/src/modules"
export FLUX_RC1_PATH="/builddir/build/BUILD/flux-core-0.8.0/etc/rc1"
$ srun -N 1 flux env
export FLUX_PMI_LIBRARY_PATH="/usr/lib64/flux/libpmi.so"
export LUA_PATH="/usr/share/lua/5.1/?.lua;;;"
export FLUX_EXEC_PATH="/usr/libexec/flux/cmd"
export FLUX_RC3_PATH="/etc/flux/rc3"
export FLUX_SEC_DIRECTORY="/g/g12/scogland/.flux"
export PYTHONPATH="/usr/lib64/python2.7/site-packages"
export LUA_CPATH="/usr/lib64/lua/5.1/?.so;;;"
export PATH="/g/g12/scogland/.local/bin:/g/g12/scogland/scripts:/g/g12/scogland/programs/toss_3_x86_64_ib/cargo/bin:/g/g12/scogland/programs/toss_3_x86_64_ib/go/bin:/g/g12/scogland/spack/bin:/usr/local/bin:/opt/local/libexec/gnubin:/usr/local/sbin:/opt/local/sbin:/usr/tce/packages/texlive/texlive-2016/2016/bin/x86_64-linux:/usr/tce/packages/mvapich2/mvapich2-2.2-intel-18.0.1/bin:/usr/tce/packages/intel/intel-18.0.1/bin:/usr/tce/bin:/usr/lib64/qt-3.3/bin:/g/g12/scogland/programs/default/cargo/bin:/g/g12/scogland/programs/default/go/bin:/g/g12/scogland/programs/default/bin:/usr/bin:/g/g12/scogland/bin:/usr/sbin:/sbin"
export FLUX_CONNECTOR_PATH="/usr/lib64/flux/connectors"
export MANPATH="/usr/tce/packages/dotkit/dotkit/man:/usr/man:/usr/share/man:/usr/local/man:/usr/X11R6/man:/usr/lib64/mvapich/default/man"
export FLUX_WREXECD_PATH="/usr/libexec/flux/wrexecd"
export FLUX_WRECK_LUA_PATTERN="/etc/wreck/lua.d/*.lua"
export FLUX_MODULE_PATH="/usr/lib64/flux/modules"
export FLUX_RC1_PATH="/etc/flux/rc1"

Since flux works out whether to do that or not by asking if it’s being launched from the install directory or not, I also tried adding -a no to tell spindle not to do anything with the base binary. That changes the outcome, in that we don’t see wrong paths, but now nothing runs at all. Even the builtin commands like env wont run:

$ spindle -a no srun -N 2 flux env
srun: error: quartz1400: task 1: Exited with exit code 255
srun: error: quartz1396: task 0: Exited with exit code 255

Here is the output of SPINDLE_DEBUG=3 from attempting to actually have spindle run slurm run flux run wreckrun run hostname:

[FE.155085@spindle_fe_main.cc:108] - Spindle Command Line: /usr/tce/packages/spindle/spindle/bin/spindle srun -N 2 --exclusive -ppdebug flux broker flux wreckrun -N 2 hostname
[FE.155085@parseargs.cc:650] - Spindle options bitmask: 4010
[FE.155085@spindle_fe_main.cc:66] - Daemon CmdLine: /usr/tce/packages/spindle/spindle/libexec/spindle/spindle_be --spindle_lmon 0 155085
[FE.155106@spindle_fe_main.cc:108] - Spindle Command Line: /usr/tce/packages/spindle/spindle/bin/spindle -slurm srun -N 2 --exclusive -ppdebug flux broker flux wreckrun -N 2 hostname
[FE.155127@spindle_fe_main.cc:108] - Spindle Command Line: /usr/tce/packages/spindle/spindle/bin/spindle --slurm srun -N 2 --exclusive -ppdebug flux broker flux wreckrun -N 2 hostname
[FE.155127@parseargs.cc:650] - Spindle options bitmask: 4010
[FE.155127@spindle_fe_main.cc:66] - Daemon CmdLine: /usr/tce/packages/spindle/spindle/libexec/spindle/spindle_be --spindle_lmon 0 155127
[FE.155127@parse_launcher.cc:107] - Launcher Parsing: Using slurm to parse command line:
        srun -N 2 --exclusive -ppdebug flux broker flux wreckrun -N 2 hostname
[FE.155127@parse_launcher.cc:137] - Launcher Parsing: -N is a launcher argument with values
[FE.155127@parse_launcher.cc:139] - Launcher Parsing: 2 is a argument value to -N
[FE.155127@parse_launcher.cc:135] - Launcher Parsing: --exclusive is a launcher argument
[FE.155127@parse_launcher.cc:158] - Launcher Parsing: Warning: -ppdebug is an unrecognized option.  Assuming it's a launcher option
[FE.155127@parse_launcher.cc:125] - Launcher Parsing: flux is the application executable
[FE.155127@parse_launcher.cc:369] - Launcher Parsing: New command line is:
        srun -N 2 --exclusive -ppdebug /usr/tce/packages/spindle/spindle/libexec/spindle/spindle_bootstrap $TMPDIR/spindle.155127 155127 4010 0 flux broker flux wreckrun -N 2 hostname
[FE.155127@spindle_fe_main.cc:75] - Application CmdLine: srun -N 2 --exclusive -ppdebug /usr/tce/packages/spindle/spindle/libexec/spindle/spindle_bootstrap $TMPDIR/spindle.155127 155127 4010 0 flux broker flux wreckrun -N 2 hostname
[FE.155127@spindle_fe_main.cc:86] - Starting application with launchmon
[FE.155127@spindle_fe_lmon.cc:226] - ERROR: [LMON FE] LMON_fe_launchAndSpawnDaemons FAILED
0x4132f0 - /usr/tce/packages/spindle/spindle/bin/spindle() [0x4132f0]
0x421d57 - /usr/tce/packages/spindle/spindle/bin/spindle() [0x421d57]
0x4049fb - /usr/tce/packages/spindle/spindle/bin/spindle() [0x4049fb]
0x2aaaad09cc05 - /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaad09cc05]
0x404cd7 - /usr/tce/packages/spindle/spindle/bin/spindle() [0x404cd7]
grondo commented 6 years ago

Thanks for the testing!

As you deduced, the spindle environment is probably confusing flux's flux_is_installed() function, which basically compares /proc/self/exe with compiled-in bindir (the bindir compiled in as result of ./configure, e.g. $prefix/bin). I kind of wonder what is going on there, but would a cmdline option to force flux to use installed paths help here?

I have no idea about the spindle -a case, the debug output seems to indicate that the launchmon startup failed. Does spindle -a no srun -N2 mpi-hello work fine?

dongahn commented 6 years ago

Adding @mplegendre.

As a first cut to see if we could at least get started, Frank and I just went through a little testing on quartz, using the installed versions of both spindle and flux, to see what they would do if we put them together. The results were interesting, if not terribly successful, and I’m not quite sure where to even file this.

In this mode, it seems spindle relocated the flux base executable and flux doesn't like it when it's relocated.

Here is the output of SPINDLE_DEBUG=3 from attempting to actually have spindle run slurm run flux run wreckrun run hostname:

I don't know why it failed. I need to see what's get passed to LMON_fe_launchAndSpawnDaemons

trws commented 6 years ago

Annoyingly enough, I hadn't thought to test that @grondo. Apparently the answer is "no." Something is wrong with -a on quartz, it can't even run hostname. So we're in an odd spot where we can't use -a but we have to...

Tests:

scogland at quartz34 in ~  (SLURM:710193) !130!
$ env SPINDLE_DEBUG=3 spindle --slurm srun -n 2   hostname
quartz34
quartz35

scogland at quartz34 in ~  (SLURM:710193)
$ env SPINDLE_DEBUG=3 spindle -a no --slurm srun -n 2   hostname
srun: error: quartz35: task 1: Exited with exit code 255
srun: error: quartz34: task 0: Exited with exit code 255

Output from failed second run:

[FE.188491@spindle_fe_main.cc:108] - Spindle Command Line: /usr/tce/packages/spindle/spindle/bin/spindle -a no --slurm srun -n 2 hostname
[FE.188491@parseargs.cc:650] - Spindle options bitmask: 3882
[FE.188491@spindle_fe_main.cc:66] - Daemon CmdLine: /usr/tce/packages/spindle/spindle/libexec/spindle/spindle_be --spindle_lmon 0 188491
[FE.188491@parse_launcher.cc:107] - Launcher Parsing: Using slurm to parse command line:
        srun -n 2 hostname
[FE.188491@parse_launcher.cc:137] - Launcher Parsing: -n is a launcher argument with values
[FE.188491@parse_launcher.cc:139] - Launcher Parsing: 2 is a argument value to -n
[FE.188491@parse_launcher.cc:125] - Launcher Parsing: hostname is the application executable
[FE.188491@parse_launcher.cc:369] - Launcher Parsing: New command line is:
        srun -n 2 /usr/tce/packages/spindle/spindle/libexec/spindle/spindle_bootstrap $TMPDIR/spindle.188491 188491 3882 0 hostname
[FE.188491@spindle_fe_main.cc:75] - Application CmdLine: srun -n 2 /usr/tce/packages/spindle/spindle/libexec/spindle/spindle_bootstrap $TMPDIR/spindle.188491 188491 3882 0 hostname
[FE.188491@spindle_fe_main.cc:86] - Starting application with launchmon
[Server.188531@spindle_be_main.cc:52] - Spindle Server Cmdline: /usr/tce/packages/spindle/spindle/libexec/spindle/spindle_be --spindle_lmon 0 188491 --lmonsharedsec=1349120343 --lmonsecchk=564495131
[Server.188531@spindle_be_lmon.cc:167] - Launchmon rank 0/2
[FE.188491@spindle_fe.cc:223] - Called spindleInitFE
[FE.188491@spindle_fe.cc:147] - Initializing FE with munge-based security
[FE.188491@spindle_fe.cc:246] - Starting FE servers with hostlist of size 2 on port 21940
[FE.188491@cobo_fe_comm.c:41] - Opening with port 21940 - 21964
[Server.188531@spindle_be.cc:74] - Initializing BE with munge-based security
[Server.188531@spindle_be.cc:120] - spindleRunBE setting up network and receiving setup data
[Server.188531@ldcs_audit_server_process.c:64] - Setting up server data structure
[Server.188531@ldcs_audit_server_md_cobo.c:95] - Opening cobo with port 21940 - 21964
[Server.188531@cobo.c:1438] - In cobo_init():
COBO_CONNECT_TIMEOUT: 10, COBO_CONNECT_BACKOFF: 2, COBO_CONNECT_SLEEP: 10, COBO_CONNECT_TIMELIMIT: 600
[Server.188531@cobo.c:790] - Opened socket on port 21940
[FE.188491@cobo.c:487] - Trying rank 0 port 21940 on quartz34
[FE.188491@cobo.c:182] - _cobo_opt_socket (sockfd=10) flag=1
[FE.188491@cobo.c:492] - Connected to rank 0 port 21940 on quartz34
[Server.188531@cobo.c:182] - _cobo_opt_socket (sockfd=7) flag=1
[handshake.c:169] - Starting handshake from client
[handshake.c:1108] - Looking up server and client addresses for socket 10
[handshake.c:1139] - Sending sig 845d96c1 on network
[handshake.c:1146] - Receiving sig from network
[handshake.c:291] - Creating outgoing packet for handshake
[handshake.c:302] - Encoded packet: server_port = 46165, client_port = 26347, uid = 37084, gid = 37084, session_id = 10506943495924867147, signature = 9b1cc028
[handshake.c:307] - Encrypting outgoing packet
[handshake.c:444] - Server encrypting packet with munge
[handshake.c:531] - Munge encoded packet successfully
[handshake.c:314] - Encrypted packet to buffer of size 228
[handshake.c:1165] - Sending packet size on network
[handshake.c:1173] - Sending packet on network
[handshake.c:1188] - Receiving packet size from network
[handshake.c:1194] - Received packet size 228
[handshake.c:1207] - Received packet from network
[handshake.c:341] - Creating an expected packet
[handshake.c:354] - Decrypting and checking packet
[handshake.c:808] - Decrypting and checking packet with munge
[handshake.c:1054] - Packets compared equal.
[handshake.c:362] - Successfully completed initial handshake
[handshake.c:1077] - Sharing handshake result 0 with peer
[handshake.c:1085] - Reading peer result
[handshake.c:1091] - Peer reported result of 0
[handshake.c:260] - Completed server handshake.  Result = 0
[FE.188491@cobo.c:600] - Sending hostlist to rank 0 on quartz34
[handshake.c:163] - Starting handshake from server
[handshake.c:1108] - Looking up server and client addresses for socket 8
[handshake.c:1139] - Sending sig 845d96c1 on network
[handshake.c:1146] - Receiving sig from network
[handshake.c:291] - Creating outgoing packet for handshake
[handshake.c:302] - Encoded packet: server_port = 46165, client_port = 26347, uid = 37084, gid = 37084, session_id = 10506943495924867147, signature = 67ad047e
[handshake.c:307] - Encrypting outgoing packet
[handshake.c:444] - Server encrypting packet with munge
[handshake.c:531] - Munge encoded packet successfully
[handshake.c:314] - Encrypted packet to buffer of size 228
[handshake.c:1165] - Sending packet size on network
[handshake.c:1173] - Sending packet on network
[handshake.c:1188] - Receiving packet size from network
[handshake.c:1194] - Received packet size 228
[handshake.c:1207] - Received packet from network
[handshake.c:341] - Creating an expected packet
[handshake.c:354] - Decrypting and checking packet
[handshake.c:808] - Decrypting and checking packet with munge
[handshake.c:1054] - Packets compared equal.
[handshake.c:362] - Successfully completed initial handshake
[handshake.c:1077] - Sharing handshake result 0 with peer
[handshake.c:1085] - Reading peer result
[handshake.c:1091] - Peer reported result of 0
[handshake.c:260] - Completed server handshake.  Result = 0
[Server.188531@cobo.c:933] - 0: on COBO00: connect to child #01 (quartz35)
[Server.188531@cobo.c:487] - Trying rank 1 port 21940 on quartz35
[Server.188531@cobo.c:182] - _cobo_opt_socket (sockfd=7) flag=1
[Server.188531@cobo.c:492] - Connected to rank 1 port 21940 on quartz35
[handshake.c:169] - Starting handshake from client
[handshake.c:1108] - Looking up server and client addresses for socket 7
[handshake.c:1139] - Sending sig 845d96c1 on network
[handshake.c:1146] - Receiving sig from network
[handshake.c:291] - Creating outgoing packet for handshake
[handshake.c:302] - Encoded packet: server_port = 46165, client_port = 667, uid = 37084, gid = 37084, session_id = 10506943495924867147, signature = 9b1cc028
[handshake.c:307] - Encrypting outgoing packet
[handshake.c:444] - Server encrypting packet with munge
[handshake.c:531] - Munge encoded packet successfully
[handshake.c:314] - Encrypted packet to buffer of size 228
[handshake.c:1165] - Sending packet size on network
[handshake.c:1173] - Sending packet on network
[handshake.c:1188] - Receiving packet size from network
[handshake.c:1194] - Received packet size 228
[handshake.c:1207] - Received packet from network
[handshake.c:341] - Creating an expected packet
[handshake.c:354] - Decrypting and checking packet
[handshake.c:808] - Decrypting and checking packet with munge
[handshake.c:1054] - Packets compared equal.
[handshake.c:362] - Successfully completed initial handshake
[handshake.c:1077] - Sharing handshake result 0 with peer
[handshake.c:1085] - Reading peer result
[handshake.c:1091] - Peer reported result of 0
[handshake.c:260] - Completed server handshake.  Result = 0
[Server.188531@cobo.c:600] - Sending hostlist to rank 1 on quartz35
[Server.188531@cobo.c:1193] - Starting cobo_barrier()
[Server.188531@cobo.c:1201] - Exiting cobo_barrier(), took 0.000109 seconds for 2 procs
[Server.188531@cobo.c:1459] - Exiting cobo_close(), took 0.083390 seconds for 2 procs
[Server.188531@cobo.c:1467] - Exiting cobo_init(), took 0.165283 seconds for 2 procs
[Server.188531@ldcs_audit_server_md_cobo.c:100] - cobo_open complete. Cobo rank 0/2
[Server.188531@cobo.c:1193] - Starting cobo_barrier()
[Server.188531@cobo.c:1201] - Exiting cobo_barrier(), took 0.000330 seconds for 2 procs
[Server.188531@ldcs_audit_server_md_cobo.c:120] - sent FE client signal that server are ready 13
[Server.188531@ldcs_audit_server_process.c:76] - Reading setup message from parent
[FE.188491@spindle_fe.cc:252] - Sending parameters to servers
[FE.188491@cobo_fe_comm.c:71] - Broadcasting message to daemons
[FE.188491@cobo_comm.c:186] - Wrote 16 bytes to network: 38 0 0...
[Server.188531@cobo_comm.c:166] - Read 16 bytes from network: 38 0 0...
[FE.188491@cobo_comm.c:186] - Wrote 167 bytes to network: 75 -32 2...
[Server.188531@cobo_comm.c:166] - Read 167 bytes from network: 75 -32 2...
[Server.188531@cobo_comm.c:186] - Wrote 16 bytes to network: 38 0 0...
[Server.188531@cobo_comm.c:186] - Wrote 167 bytes to network: 75 -32 2...
[Server.188531@spindle_be.cc:140] - Translated location from $TMPDIR/spindle.188491 to /var/tmp/scogland/spindle.188491
[Server.188531@ldcs_audit_server_process.c:99] - Initializing server data structures
[Server.188531@ldcs_audit_server_process.c:115] - Using PUSH model
[Server.188531@ldcs_audit_server_process.c:132] - Initializing file cache location /var/tmp/scogland/spindle.188491
[Server.188531@spindle_mkdir.c:71] - spindle_mkdir on /var/tmp/scogland/spindle.188491
[Server.188531@ldcs_audit_server_process.c:136] - Initializing connections for clients at /var/tmp/scogland/spindle.188491 and 188491
[Server.188531@spindle_mkdir.c:71] - spindle_mkdir on /var/tmp/scogland/spindle.188491/spindle_comm
[Server.188531@ldcs_api_pipe_notify.c:90] - add watch: ntfd=9 dir=/var/tmp/scogland/spindle.188491/spindle_comm
[Server.188531@ldcs_api_pipe_notify.c:119] - return ntfd=9
[Server.188531@ldcs_audit_server_md_cobo.c:136] - Registering fd 8 for cobo parent connection
[Server.188531@ldcs_api_listen.c:100] - registered fd 8 id=0  c=0
[Server.188531@ldcs_api_listen.c:100] - registered fd 7 id=0  c=1
[Server.188531@ldcs_api_listen.c:100] - registered fd 9 id=0  c=2
[Server.188531@ldcs_audit_server_process.c:152] - Initializing cache
[Server.188531@spindle_be_lmon.cc:76] - Sending SIGCONTs to each process to release debugger stops
[Server.188531@spindle_be_lmon.cc:101] - [LMON BE] kill 188515, 18
[Server.188531@spindle_be.cc:158] - Setup done.  Running server.
[Server.188531@ldcs_audit_server_process.c:161] - Entering server loop
[Server.188531@ldcs_api_listen.c:140] - Listening for data
[Client.188515@spindle_bootstrap.c:341] - Launched Spindle Bootstrapper
[Client.188515@spindle_bootstrap.c:331] - Realized /var/tmp/scogland/spindle.188491 to /tmp/scogland/spindle.188491
[Client.188515@spindle_bootstrap.c:373] - Spindle bootstrap launching: hostname.  Args:  hostname
[Client.188515@spindle_bootstrap.c:395] - ERROR: Error execing app: No such file or directory
[FE.188491@cobo_fe_comm.c:57] - Sending exit message to daemons
[FE.188491@cobo_comm.c:186] - Wrote 16 bytes to network: 41 0 0...
[Server.188531@ldcs_api_listen.c:174] - Select returned data.  Calling callback for fd 8 id=0
[Server.188531@cobo_comm.c:166] - Read 16 bytes from network: 41 0 0...
[Server.188531@ldcs_audit_server_handlers.c:795] - Setting up Exiting after receiving exit bcast message
[Server.188531@cobo_comm.c:186] - Wrote 16 bytes to network: 41 0 0...
[Server.188531@ldcs_audit_server_process.c:229] - SERVER[00] STAT: #conn= 0 md_size= 0 md_fan_out= 0 listen_time=1525909055.2707 select_time=1525909055.2707 ts_first_connect=       -1.000000 hostname=quartz34
[Server.188531@ldcs_audit_server_process.c:237] - SERVER[00] STAT:  libread   , #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.188531@ldcs_audit_server_process.c:243] - SERVER[00] STAT:  libstore  , #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.188531@ldcs_audit_server_process.c:249] - SERVER[00] STAT:  libdist   , #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.188531@ldcs_audit_server_process.c:255] - SERVER[00] STAT:  procdir   , #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.188531@ldcs_audit_server_process.c:261] - SERVER[00] STAT:  distdir   , #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.188531@ldcs_audit_server_process.c:267] - SERVER[00] STAT:  client_cb , #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.188531@ldcs_audit_server_process.c:273] - SERVER[00] STAT:  server_cb , #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.188531@ldcs_audit_server_process.c:279] - SERVER[00] STAT:  md_cb     , #cnt=    1, bytes=    0.00 MB, time=  0.0000 sec
[Server.188531@ldcs_audit_server_process.c:285] - SERVER[00] STAT:  cl_msg_avg, #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.188531@ldcs_audit_server_process.c:291] - SERVER[00] STAT:  bcast     , #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.188531@ldcs_audit_server_process.c:297] - SERVER[00] STAT:  preload_cb, #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.188531@ldcs_audit_server_process.c:174] - destroy server (/var/tmp/scogland/spindle.188491,188491)
[Server.188531@ldcs_audit_server_filemngt.c:185] - Cleaning tmpdir /var/tmp/scogland/spindle.188491
[Server.188531@ldcs_audit_server_filemngt.c:202] - Not cleaning file /var/tmp/scogland/spindle.188491/.
[Server.188531@ldcs_audit_server_filemngt.c:202] - Not cleaning file /var/tmp/scogland/spindle.188491/..
[Server.182570@spindle_be_main.cc:52] - Spindle Server Cmdline: /usr/tce/packages/spindle/spindle/libexec/spindle/spindle_be --spindle_lmon 0 188491 --lmonsharedsec=1349120343 --lmonsecchk=564495131
[Server.182570@spindle_be_lmon.cc:167] - Launchmon rank 1/2
[Server.182570@spindle_be.cc:74] - Initializing BE with munge-based security
[Server.182570@spindle_be.cc:120] - spindleRunBE setting up network and receiving setup data
[Server.182570@ldcs_audit_server_process.c:64] - Setting up server data structure
[Server.182570@ldcs_audit_server_md_cobo.c:95] - Opening cobo with port 21940 - 21964
[Server.182570@cobo.c:1438] - In cobo_init():
COBO_CONNECT_TIMEOUT: 10, COBO_CONNECT_BACKOFF: 2, COBO_CONNECT_SLEEP: 10, COBO_CONNECT_TIMELIMIT: 600
[Server.182570@cobo.c:790] - Opened socket on port 21940
[Server.182570@cobo.c:182] - _cobo_opt_socket (sockfd=4) flag=1
[handshake.c:163] - Starting handshake from server
[handshake.c:1108] - Looking up server and client addresses for socket 6
[handshake.c:1139] - Sending sig 845d96c1 on network
[handshake.c:1146] - Receiving sig from network
[handshake.c:291] - Creating outgoing packet for handshake
[handshake.c:302] - Encoded packet: server_port = 46165, client_port = 667, uid = 37084, gid = 37084, session_id = 10506943495924867147, signature = 67ad047e
[handshake.c:307] - Encrypting outgoing packet
[handshake.c:444] - Server encrypting packet with munge
[handshake.c:531] - Munge encoded packet successfully
[handshake.c:314] - Encrypted packet to buffer of size 228
[handshake.c:1165] - Sending packet size on network
[handshake.c:1173] - Sending packet on network
[handshake.c:1188] - Receiving packet size from network
[handshake.c:1194] - Received packet size 228
[handshake.c:1207] - Received packet from network
[handshake.c:341] - Creating an expected packet
[handshake.c:354] - Decrypting and checking packet
[handshake.c:808] - Decrypting and checking packet with munge
[handshake.c:1054] - Packets compared equal.
[handshake.c:362] - Successfully completed initial handshake
[handshake.c:1077] - Sharing handshake result 0 with peer
[handshake.c:1085] - Reading peer result
[handshake.c:1091] - Peer reported result of 0
[handshake.c:260] - Completed server handshake.  Result = 0
[Server.182570@cobo.c:1193] - Starting cobo_barrier()
[Server.182570@cobo.c:1201] - Exiting cobo_barrier(), took 0.000235 seconds for 2 procs
[Server.182570@cobo.c:1467] - Exiting cobo_init(), took 0.164691 seconds for 2 procs
[Server.182570@ldcs_audit_server_md_cobo.c:100] - cobo_open complete. Cobo rank 1/2
[Server.182570@cobo.c:1193] - Starting cobo_barrier()
[Server.182570@cobo.c:1201] - Exiting cobo_barrier(), took 0.000268 seconds for 2 procs
[Server.182570@ldcs_audit_server_process.c:76] - Reading setup message from parent
[Server.182570@cobo_comm.c:166] - Read 16 bytes from network: 38 0 0...
[Server.182570@cobo_comm.c:166] - Read 167 bytes from network: 75 -32 2...
[Server.182570@spindle_be.cc:140] - Translated location from $TMPDIR/spindle.188491 to /var/tmp/scogland/spindle.188491
[Server.182570@ldcs_audit_server_process.c:99] - Initializing server data structures
[Server.182570@ldcs_audit_server_process.c:115] - Using PUSH model
[Server.182570@ldcs_audit_server_process.c:132] - Initializing file cache location /var/tmp/scogland/spindle.188491
[Server.182570@spindle_mkdir.c:71] - spindle_mkdir on /var/tmp/scogland/spindle.188491
[Server.182570@ldcs_audit_server_process.c:136] - Initializing connections for clients at /var/tmp/scogland/spindle.188491 and 188491
[Server.182570@spindle_mkdir.c:71] - spindle_mkdir on /var/tmp/scogland/spindle.188491/spindle_comm
[Server.182570@ldcs_api_pipe_notify.c:90] - add watch: ntfd=4 dir=/var/tmp/scogland/spindle.188491/spindle_comm
[Server.182570@ldcs_api_pipe_notify.c:119] - return ntfd=4
[Server.182570@ldcs_audit_server_md_cobo.c:136] - Registering fd 6 for cobo parent connection
[Server.182570@ldcs_api_listen.c:100] - registered fd 6 id=0  c=0
[Server.182570@ldcs_api_listen.c:100] - registered fd 4 id=0  c=1
[Server.182570@ldcs_audit_server_process.c:152] - Initializing cache
[Server.182570@spindle_be_lmon.cc:76] - Sending SIGCONTs to each process to release debugger stops
[Server.182570@spindle_be_lmon.cc:101] - [LMON BE] kill 182562, 18
[Server.182570@spindle_be.cc:158] - Setup done.  Running server.
[Server.182570@ldcs_audit_server_process.c:161] - Entering server loop
[Server.182570@ldcs_api_listen.c:140] - Listening for data
[Client.182562@spindle_bootstrap.c:341] - Launched Spindle Bootstrapper
[Client.182562@spindle_bootstrap.c:331] - Realized /var/tmp/scogland/spindle.188491 to /tmp/scogland/spindle.188491
[Client.182562@spindle_bootstrap.c:373] - Spindle bootstrap launching: hostname.  Args:  hostname
[Client.182562@spindle_bootstrap.c:395] - ERROR: Error execing app: No such file or directory
[Server.182570@ldcs_api_listen.c:174] - Select returned data.  Calling callback for fd 6 id=0
[Server.182570@cobo_comm.c:166] - Read 16 bytes from network: 41 0 0...
[Server.182570@ldcs_audit_server_handlers.c:795] - Setting up Exiting after receiving exit bcast message
[Server.182570@ldcs_audit_server_process.c:229] - SERVER[00] STAT: #conn= 0 md_size= 0 md_fan_out= 0 listen_time=1525909055.2707 select_time=1525909055.2707 ts_first_connect=       -1.000000 hostname=quartz35
[Server.182570@ldcs_audit_server_process.c:237] - SERVER[00] STAT:  libread   , #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.182570@ldcs_audit_server_process.c:243] - SERVER[00] STAT:  libstore  , #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.182570@ldcs_audit_server_process.c:249] - SERVER[00] STAT:  libdist   , #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.182570@ldcs_audit_server_process.c:255] - SERVER[00] STAT:  procdir   , #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.182570@ldcs_audit_server_process.c:261] - SERVER[00] STAT:  distdir   , #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.182570@ldcs_audit_server_process.c:267] - SERVER[00] STAT:  client_cb , #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.182570@ldcs_audit_server_process.c:273] - SERVER[00] STAT:  server_cb , #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.182570@ldcs_audit_server_process.c:279] - SERVER[00] STAT:  md_cb     , #cnt=    1, bytes=    0.00 MB, time=  0.0000 sec
[Server.182570@ldcs_audit_server_process.c:285] - SERVER[00] STAT:  cl_msg_avg, #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.182570@ldcs_audit_server_process.c:291] - SERVER[00] STAT:  bcast     , #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.182570@ldcs_audit_server_process.c:297] - SERVER[00] STAT:  preload_cb, #cnt=    0, bytes=    0.00 MB, time=  0.0000 sec
[Server.182570@ldcs_audit_server_process.c:174] - destroy server (/var/tmp/scogland/spindle.188491,188491)
[Server.182570@ldcs_audit_server_filemngt.c:185] - Cleaning tmpdir /var/tmp/scogland/spindle.188491
[Server.182570@ldcs_audit_server_filemngt.c:202] - Not cleaning file /var/tmp/scogland/spindle.188491/.
[Server.182570@ldcs_audit_server_filemngt.c:202] - Not cleaning file /var/tmp/scogland/spindle.188491/..
mplegendre commented 6 years ago

I see you answered my "test another app" question in this email. I can reproduce the "-a no" problem too, looks like a PATH search problem. "spindle -a no -n 2 /bin/hostname" works while "spindle -a no -n 2 hostname" fails. I'm pretty sure we need to change an execv to an execvp on the code path that runs with "-a no".

I can similarly get different results by putting the full path to /usr/bin/flux on a command line that includes "--debug=yes" vs a relative path of just flux. I can't immediately explain that, but does it get you farther?

-Matt

On Wed, 9 May 2018, Tom Scogland wrote:

Annoyingly enough, I hadn't thought to test that @grondo. Apparently the answer is "no." Something is wrong with -a on quartz, it can't even run hostname. So we're in an odd spot where we can't use -a but we have to...

Tests:

scogland at quartz34 in ~ (SLURM:710193) !130! $ env SPINDLE_DEBUG=3 spindle --slurm srun -n 2 hostname quartz34 quartz35

scogland at quartz34 in ~ (SLURM:710193) $ env SPINDLE_DEBUG=3 spindle -a no --slurm srun -n 2 hostname srun: error: quartz35: task 1: Exited with exit code 255 srun: error: quartz34: task 0: Exited with exit code 255

Output from failed second run:

[FE.188491@spindle_fe_main.cc:108] - Spindle Command Line: /usr/tce/packages/spin dle/spindle/bin/spindle -a no --slurm srun -n 2 hostname [FE.188491@parseargs.cc:650] - Spindle options bitmask: 3882 [FE.188491@spindle_fe_main.cc:66] - Daemon CmdLine: /usr/tce/packages/spindle/spi ndle/libexec/spindle/spindle_be --spindle_lmon 0 188491 [FE.188491@parse_launcher.cc:107] - Launcher Parsing: Using slurm to parse comman d line: srun -n 2 hostname [FE.188491@parse_launcher.cc:137] - Launcher Parsing: -n is a launcher argument w ith values [FE.188491@parse_launcher.cc:139] - Launcher Parsing: 2 is a argument value to -n [FE.188491@parse_launcher.cc:125] - Launcher Parsing: hostname is the application executable [FE.188491@parse_launcher.cc:369] - Launcher Parsing: New command line is: srun -n 2 /usr/tce/packages/spindle/spindle/libexec/spindle/spindle_boots trap $TMPDIR/spindle.188491 188491 3882 0 hostname [FE.188491@spindle_fe_main.cc:75] - Application CmdLine: srun -n 2 /usr/tce/packa ges/spindle/spindle/libexec/spindle/spindle_bootstrap $TMPDIR/spindle.188491 1884 91 3882 0 hostname [FE.188491@spindle_fe_main.cc:86] - Starting application with launchmon [Server.188531@spindle_be_main.cc:52] - Spindle Server Cmdline: /usr/tce/packages /spindle/spindle/libexec/spindle/spindle_be --spindle_lmon 0 188491 --lmonshareds ec=1349120343 --lmonsecchk=564495131 [Server.188531@spindle_be_lmon.cc:167] - Launchmon rank 0/2 [FE.188491@spindle_fe.cc:223] - Called spindleInitFE [FE.188491@spindle_fe.cc:147] - Initializing FE with munge-based security [FE.188491@spindle_fe.cc:246] - Starting FE servers with hostlist of size 2 on po rt 21940 [FE.188491@cobo_fe_comm.c:41] - Opening with port 21940 - 21964 [Server.188531@spindle_be.cc:74] - Initializing BE with munge-based security [Server.188531@spindle_be.cc:120] - spindleRunBE setting up network and receiving setup data [Server.188531@ldcs_audit_server_process.c:64] - Setting up server data structure [Server.188531@ldcs_audit_server_md_cobo.c:95] - Opening cobo with port 21940 - 2 1964 [Server.188531@cobo.c:1438] - In cobo_init(): COBO_CONNECT_TIMEOUT: 10, COBO_CONNECT_BACKOFF: 2, COBO_CONNECT_SLEEP: 10, COBO_C ONNECT_TIMELIMIT: 600 [Server.188531@cobo.c:790] - Opened socket on port 21940 [FE.188491@cobo.c:487] - Trying rank 0 port 21940 on quartz34 [FE.188491@cobo.c:182] - _cobo_opt_socket (sockfd=10) flag=1 [FE.188491@cobo.c:492] - Connected to rank 0 port 21940 on quartz34 [Server.188531@cobo.c:182] - _cobo_opt_socket (sockfd=7) flag=1 [handshake.c:169] - Starting handshake from client [handshake.c:1108] - Looking up server and client addresses for socket 10 [handshake.c:1139] - Sending sig 845d96c1 on network [handshake.c:1146] - Receiving sig from network [handshake.c:291] - Creating outgoing packet for handshake [handshake.c:302] - Encoded packet: server_port = 46165, client_port = 26347, uid = 37084, gid = 37084, session_id = 10506943495924867147, signature = 9b1cc028 [handshake.c:307] - Encrypting outgoing packet [handshake.c:444] - Server encrypting packet with munge [handshake.c:531] - Munge encoded packet successfully [handshake.c:314] - Encrypted packet to buffer of size 228 [handshake.c:1165] - Sending packet size on network [handshake.c:1173] - Sending packet on network [handshake.c:1188] - Receiving packet size from network [handshake.c:1194] - Received packet size 228 [handshake.c:1207] - Received packet from network [handshake.c:341] - Creating an expected packet [handshake.c:354] - Decrypting and checking packet [handshake.c:808] - Decrypting and checking packet with munge [handshake.c:1054] - Packets compared equal. [handshake.c:362] - Successfully completed initial handshake [handshake.c:1077] - Sharing handshake result 0 with peer [handshake.c:1085] - Reading peer result [handshake.c:1091] - Peer reported result of 0 [handshake.c:260] - Completed server handshake. Result = 0 [FE.188491@cobo.c:600] - Sending hostlist to rank 0 on quartz34 [handshake.c:163] - Starting handshake from server [handshake.c:1108] - Looking up server and client addresses for socket 8 [handshake.c:1139] - Sending sig 845d96c1 on network [handshake.c:1146] - Receiving sig from network [handshake.c:291] - Creating outgoing packet for handshake [handshake.c:302] - Encoded packet: server_port = 46165, client_port = 26347, uid = 37084, gid = 37084, session_id = 10506943495924867147, signature = 67ad047e [handshake.c:307] - Encrypting outgoing packet [handshake.c:444] - Server encrypting packet with munge [handshake.c:531] - Munge encoded packet successfully [handshake.c:314] - Encrypted packet to buffer of size 228 [handshake.c:1165] - Sending packet size on network [handshake.c:1173] - Sending packet on network [handshake.c:1188] - Receiving packet size from network [handshake.c:1194] - Received packet size 228 [handshake.c:1207] - Received packet from network [handshake.c:341] - Creating an expected packet [handshake.c:354] - Decrypting and checking packet [handshake.c:808] - Decrypting and checking packet with munge [handshake.c:1054] - Packets compared equal. [handshake.c:362] - Successfully completed initial handshake [handshake.c:1077] - Sharing handshake result 0 with peer [handshake.c:1085] - Reading peer result [handshake.c:1091] - Peer reported result of 0 [handshake.c:260] - Completed server handshake. Result = 0 [Server.188531@cobo.c:933] - 0: on COBO00: connect to child #01 (quartz35) [Server.188531@cobo.c:487] - Trying rank 1 port 21940 on quartz35 [Server.188531@cobo.c:182] - _cobo_opt_socket (sockfd=7) flag=1 [Server.188531@cobo.c:492] - Connected to rank 1 port 21940 on quartz35 [handshake.c:169] - Starting handshake from client [handshake.c:1108] - Looking up server and client addresses for socket 7 [handshake.c:1139] - Sending sig 845d96c1 on network [handshake.c:1146] - Receiving sig from network [handshake.c:291] - Creating outgoing packet for handshake [handshake.c:302] - Encoded packet: server_port = 46165, client_port = 667, uid = 37084, gid = 37084, session_id = 10506943495924867147, signature = 9b1cc028 [handshake.c:307] - Encrypting outgoing packet [handshake.c:444] - Server encrypting packet with munge [handshake.c:531] - Munge encoded packet successfully [handshake.c:314] - Encrypted packet to buffer of size 228 [handshake.c:1165] - Sending packet size on network [handshake.c:1173] - Sending packet on network [handshake.c:1188] - Receiving packet size from network [handshake.c:1194] - Received packet size 228 [handshake.c:1207] - Received packet from network [handshake.c:341] - Creating an expected packet [handshake.c:354] - Decrypting and checking packet [handshake.c:808] - Decrypting and checking packet with munge [handshake.c:1054] - Packets compared equal. [handshake.c:362] - Successfully completed initial handshake [handshake.c:1077] - Sharing handshake result 0 with peer [handshake.c:1085] - Reading peer result [handshake.c:1091] - Peer reported result of 0 [handshake.c:260] - Completed server handshake. Result = 0 [Server.188531@cobo.c:600] - Sending hostlist to rank 1 on quartz35 [Server.188531@cobo.c:1193] - Starting cobo_barrier() [Server.188531@cobo.c:1201] - Exiting cobo_barrier(), took 0.000109 seconds for 2 procs [Server.188531@cobo.c:1459] - Exiting cobo_close(), took 0.083390 seconds for 2 p rocs [Server.188531@cobo.c:1467] - Exiting cobo_init(), took 0.165283 seconds for 2 pr ocs [Server.188531@ldcs_audit_server_md_cobo.c:100] - cobo_open complete. Cobo rank 0 /2 [Server.188531@cobo.c:1193] - Starting cobo_barrier() [Server.188531@cobo.c:1201] - Exiting cobo_barrier(), took 0.000330 seconds for 2 procs [Server.188531@ldcs_audit_server_md_cobo.c:120] - sent FE client signal that serv er are ready 13 [Server.188531@ldcs_audit_server_process.c:76] - Reading setup message from paren t [FE.188491@spindle_fe.cc:252] - Sending parameters to servers [FE.188491@cobo_fe_comm.c:71] - Broadcasting message to daemons [FE.188491@cobo_comm.c:186] - Wrote 16 bytes to network: 38 0 0... [Server.188531@cobo_comm.c:166] - Read 16 bytes from network: 38 0 0... [FE.188491@cobo_comm.c:186] - Wrote 167 bytes to network: 75 -32 2... [Server.188531@cobo_comm.c:166] - Read 167 bytes from network: 75 -32 2... [Server.188531@cobo_comm.c:186] - Wrote 16 bytes to network: 38 0 0... [Server.188531@cobo_comm.c:186] - Wrote 167 bytes to network: 75 -32 2... [Server.188531@spindle_be.cc:140] - Translated location from $TMPDIR/spindle.1884 91 to /var/tmp/scogland/spindle.188491 [Server.188531@ldcs_audit_server_process.c:99] - Initializing server data structu res [Server.188531@ldcs_audit_server_process.c:115] - Using PUSH model [Server.188531@ldcs_audit_server_process.c:132] - Initializing file cache locatio n /var/tmp/scogland/spindle.188491 [Server.188531@spindle_mkdir.c:71] - spindle_mkdir on /var/tmp/scogland/spindle.1 88491 [Server.188531@ldcs_audit_server_process.c:136] - Initializing connections for cl ients at /var/tmp/scogland/spindle.188491 and 188491 [Server.188531@spindle_mkdir.c:71] - spindle_mkdir on /var/tmp/scogland/spindle.1 88491/spindle_comm [Server.188531@ldcs_api_pipe_notify.c:90] - add watch: ntfd=9 dir=/var/tmp/scogla nd/spindle.188491/spindle_comm [Server.188531@ldcs_api_pipe_notify.c:119] - return ntfd=9 [Server.188531@ldcs_audit_server_md_cobo.c:136] - Registering fd 8 for cobo paren t connection [Server.188531@ldcs_api_listen.c:100] - registered fd 8 id=0 c=0 [Server.188531@ldcs_api_listen.c:100] - registered fd 7 id=0 c=1 [Server.188531@ldcs_api_listen.c:100] - registered fd 9 id=0 c=2 [Server.188531@ldcs_audit_server_process.c:152] - Initializing cache [Server.188531@spindle_be_lmon.cc:76] - Sending SIGCONTs to each process to relea se debugger stops [Server.188531@spindle_be_lmon.cc:101] - [LMON BE] kill 188515, 18 [Server.188531@spindle_be.cc:158] - Setup done. Running server. [Server.188531@ldcs_audit_server_process.c:161] - Entering server loop [Server.188531@ldcs_api_listen.c:140] - Listening for data [Client.188515@spindle_bootstrap.c:341] - Launched Spindle Bootstrapper [Client.188515@spindle_bootstrap.c:331] - Realized /var/tmp/scogland/spindle.1884 91 to /tmp/scogland/spindle.188491 [Client.188515@spindle_bootstrap.c:373] - Spindle bootstrap launching: hostname. Args: hostname [Client.188515@spindle_bootstrap.c:395] - ERROR: Error execing app: No such file or directory [FE.188491@cobo_fe_comm.c:57] - Sending exit message to daemons [FE.188491@cobo_comm.c:186] - Wrote 16 bytes to network: 41 0 0... [Server.188531@ldcs_api_listen.c:174] - Select returned data. Calling callback f or fd 8 id=0 [Server.188531@cobo_comm.c:166] - Read 16 bytes from network: 41 0 0... [Server.188531@ldcs_audit_server_handlers.c:795] - Setting up Exiting after recei ving exit bcast message [Server.188531@cobo_comm.c:186] - Wrote 16 bytes to network: 41 0 0... [Server.188531@ldcs_audit_server_process.c:229] - SERVER[00] STAT: #conn= 0 md_si ze= 0 md_fan_out= 0 listen_time=1525909055.2707 select_time=1525909055.2707 ts_fi rst_connect= -1.000000 hostname=quartz34 [Server.188531@ldcs_audit_server_process.c:237] - SERVER[00] STAT: libread , # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.188531@ldcs_audit_server_process.c:243] - SERVER[00] STAT: libstore , # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.188531@ldcs_audit_server_process.c:249] - SERVER[00] STAT: libdist , # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.188531@ldcs_audit_server_process.c:255] - SERVER[00] STAT: procdir , # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.188531@ldcs_audit_server_process.c:261] - SERVER[00] STAT: distdir , # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.188531@ldcs_audit_server_process.c:267] - SERVER[00] STAT: client_cb , # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.188531@ldcs_audit_server_process.c:273] - SERVER[00] STAT: server_cb , # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.188531@ldcs_audit_server_process.c:279] - SERVER[00] STAT: md_cb , # cnt= 1, bytes= 0.00 MB, time= 0.0000 sec [Server.188531@ldcs_audit_server_process.c:285] - SERVER[00] STAT: cl_msg_avg, # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.188531@ldcs_audit_server_process.c:291] - SERVER[00] STAT: bcast , # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.188531@ldcs_audit_server_process.c:297] - SERVER[00] STAT: preload_cb, # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.188531@ldcs_audit_server_process.c:174] - destroy server (/var/tmp/scogla nd/spindle.188491,188491) [Server.188531@ldcs_audit_server_filemngt.c:185] - Cleaning tmpdir /var/tmp/scogl and/spindle.188491 [Server.188531@ldcs_audit_server_filemngt.c:202] - Not cleaning file /var/tmp/sco gland/spindle.188491/. [Server.188531@ldcs_audit_server_filemngt.c:202] - Not cleaning file /var/tmp/sco gland/spindle.188491/.. [Server.182570@spindle_be_main.cc:52] - Spindle Server Cmdline: /usr/tce/packages /spindle/spindle/libexec/spindle/spindle_be --spindle_lmon 0 188491 --lmonshareds ec=1349120343 --lmonsecchk=564495131 [Server.182570@spindle_be_lmon.cc:167] - Launchmon rank 1/2 [Server.182570@spindle_be.cc:74] - Initializing BE with munge-based security [Server.182570@spindle_be.cc:120] - spindleRunBE setting up network and receiving setup data [Server.182570@ldcs_audit_server_process.c:64] - Setting up server data structure [Server.182570@ldcs_audit_server_md_cobo.c:95] - Opening cobo with port 21940 - 2 1964 [Server.182570@cobo.c:1438] - In cobo_init(): COBO_CONNECT_TIMEOUT: 10, COBO_CONNECT_BACKOFF: 2, COBO_CONNECT_SLEEP: 10, COBO_C ONNECT_TIMELIMIT: 600 [Server.182570@cobo.c:790] - Opened socket on port 21940 [Server.182570@cobo.c:182] - _cobo_opt_socket (sockfd=4) flag=1 [handshake.c:163] - Starting handshake from server [handshake.c:1108] - Looking up server and client addresses for socket 6 [handshake.c:1139] - Sending sig 845d96c1 on network [handshake.c:1146] - Receiving sig from network [handshake.c:291] - Creating outgoing packet for handshake [handshake.c:302] - Encoded packet: server_port = 46165, client_port = 667, uid = 37084, gid = 37084, session_id = 10506943495924867147, signature = 67ad047e [handshake.c:307] - Encrypting outgoing packet [handshake.c:444] - Server encrypting packet with munge [handshake.c:531] - Munge encoded packet successfully [handshake.c:314] - Encrypted packet to buffer of size 228 [handshake.c:1165] - Sending packet size on network [handshake.c:1173] - Sending packet on network [handshake.c:1188] - Receiving packet size from network [handshake.c:1194] - Received packet size 228 [handshake.c:1207] - Received packet from network [handshake.c:341] - Creating an expected packet [handshake.c:354] - Decrypting and checking packet [handshake.c:808] - Decrypting and checking packet with munge [handshake.c:1054] - Packets compared equal. [handshake.c:362] - Successfully completed initial handshake [handshake.c:1077] - Sharing handshake result 0 with peer [handshake.c:1085] - Reading peer result [handshake.c:1091] - Peer reported result of 0 [handshake.c:260] - Completed server handshake. Result = 0 [Server.182570@cobo.c:1193] - Starting cobo_barrier() [Server.182570@cobo.c:1201] - Exiting cobo_barrier(), took 0.000235 seconds for 2 procs [Server.182570@cobo.c:1467] - Exiting cobo_init(), took 0.164691 seconds for 2 pr ocs [Server.182570@ldcs_audit_server_md_cobo.c:100] - cobo_open complete. Cobo rank 1 /2 [Server.182570@cobo.c:1193] - Starting cobo_barrier() [Server.182570@cobo.c:1201] - Exiting cobo_barrier(), took 0.000268 seconds for 2 procs [Server.182570@ldcs_audit_server_process.c:76] - Reading setup message from paren t [Server.182570@cobo_comm.c:166] - Read 16 bytes from network: 38 0 0... [Server.182570@cobo_comm.c:166] - Read 167 bytes from network: 75 -32 2... [Server.182570@spindle_be.cc:140] - Translated location from $TMPDIR/spindle.1884 91 to /var/tmp/scogland/spindle.188491 [Server.182570@ldcs_audit_server_process.c:99] - Initializing server data structu res [Server.182570@ldcs_audit_server_process.c:115] - Using PUSH model [Server.182570@ldcs_audit_server_process.c:132] - Initializing file cache locatio n /var/tmp/scogland/spindle.188491 [Server.182570@spindle_mkdir.c:71] - spindle_mkdir on /var/tmp/scogland/spindle.1 88491 [Server.182570@ldcs_audit_server_process.c:136] - Initializing connections for cl ients at /var/tmp/scogland/spindle.188491 and 188491 [Server.182570@spindle_mkdir.c:71] - spindle_mkdir on /var/tmp/scogland/spindle.1 88491/spindle_comm [Server.182570@ldcs_api_pipe_notify.c:90] - add watch: ntfd=4 dir=/var/tmp/scogla nd/spindle.188491/spindle_comm [Server.182570@ldcs_api_pipe_notify.c:119] - return ntfd=4 [Server.182570@ldcs_audit_server_md_cobo.c:136] - Registering fd 6 for cobo paren t connection [Server.182570@ldcs_api_listen.c:100] - registered fd 6 id=0 c=0 [Server.182570@ldcs_api_listen.c:100] - registered fd 4 id=0 c=1 [Server.182570@ldcs_audit_server_process.c:152] - Initializing cache [Server.182570@spindle_be_lmon.cc:76] - Sending SIGCONTs to each process to relea se debugger stops [Server.182570@spindle_be_lmon.cc:101] - [LMON BE] kill 182562, 18 [Server.182570@spindle_be.cc:158] - Setup done. Running server. [Server.182570@ldcs_audit_server_process.c:161] - Entering server loop [Server.182570@ldcs_api_listen.c:140] - Listening for data [Client.182562@spindle_bootstrap.c:341] - Launched Spindle Bootstrapper [Client.182562@spindle_bootstrap.c:331] - Realized /var/tmp/scogland/spindle.1884 91 to /tmp/scogland/spindle.188491 [Client.182562@spindle_bootstrap.c:373] - Spindle bootstrap launching: hostname. Args: hostname [Client.182562@spindle_bootstrap.c:395] - ERROR: Error execing app: No such file or directory [Server.182570@ldcs_api_listen.c:174] - Select returned data. Calling callback f or fd 6 id=0 [Server.182570@cobo_comm.c:166] - Read 16 bytes from network: 41 0 0... [Server.182570@ldcs_audit_server_handlers.c:795] - Setting up Exiting after recei ving exit bcast message [Server.182570@ldcs_audit_server_process.c:229] - SERVER[00] STAT: #conn= 0 md_si ze= 0 md_fan_out= 0 listen_time=1525909055.2707 select_time=1525909055.2707 ts_fi rst_connect= -1.000000 hostname=quartz35 [Server.182570@ldcs_audit_server_process.c:237] - SERVER[00] STAT: libread , # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.182570@ldcs_audit_server_process.c:243] - SERVER[00] STAT: libstore , # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.182570@ldcs_audit_server_process.c:249] - SERVER[00] STAT: libdist , # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.182570@ldcs_audit_server_process.c:255] - SERVER[00] STAT: procdir , # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.182570@ldcs_audit_server_process.c:261] - SERVER[00] STAT: distdir , # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.182570@ldcs_audit_server_process.c:267] - SERVER[00] STAT: client_cb , # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.182570@ldcs_audit_server_process.c:273] - SERVER[00] STAT: server_cb , # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.182570@ldcs_audit_server_process.c:279] - SERVER[00] STAT: md_cb , # cnt= 1, bytes= 0.00 MB, time= 0.0000 sec [Server.182570@ldcs_audit_server_process.c:285] - SERVER[00] STAT: cl_msg_avg, # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.182570@ldcs_audit_server_process.c:291] - SERVER[00] STAT: bcast , # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.182570@ldcs_audit_server_process.c:297] - SERVER[00] STAT: preload_cb, # cnt= 0, bytes= 0.00 MB, time= 0.0000 sec [Server.182570@ldcs_audit_server_process.c:174] - destroy server (/var/tmp/scogla nd/spindle.188491,188491) [Server.182570@ldcs_audit_server_filemngt.c:185] - Cleaning tmpdir /var/tmp/scogl and/spindle.188491 [Server.182570@ldcs_audit_server_filemngt.c:202] - Not cleaning file /var/tmp/sco gland/spindle.188491/. [Server.182570@ldcs_audit_server_filemngt.c:202] - Not cleaning file /var/tmp/sco gland/spindle.188491/..

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AEO__jnRJw0D7OfIZmDvQYCnpyb1tBuHks5tw35YgaJpZM4T5GxW.gif]

mplegendre commented 6 years ago

I ran a few more tests, and I think the launchmon failure is a red herring. It may all be about relative vs. absolute PATHs when running with "--debug=yes" or "-a no". Both these options go through a lesser used code path in spindle that should execvp instead of its execv. I'll fix this.

Dong, in response to your email: 1) Yes, when Spindle runs LaunchMON the spindle servers are the daemons and the app is flux. And 2) "-a no" is an alias for "--reloc-aout=no", which tells Spindle not to scalably broadcast the application executable to local storage. Not doing this was leaving /proc/PID/exe pointed at the correct location for flux, which achieves a similar result to "--debug=yes".

dongahn commented 6 years ago

@mplegendre: Ah, makes sense. Thanks. I think working the kinks out for this case should be useful as we are also taking the same path for the MLSI project. At some point, I do want to revisit Spindle/Flux based on some of my early prototype efforts -- towards a more seamless integration. But I have a bigger fish to fry at this point.

grondo commented 6 years ago

FYI, @garlick has a pending PR #1515 that will fix the initial part of the problem described here. I tried his branch with spindle and we get farther, but seem to still have some trouble with executing interpreted bits within the flux session. Copied from that PR:

FWIW, I tried this version with spindle, and we get a bit further, but then hit a strange error in rc1:

 grondo@ipa1:~$ spindle srun --pty -N2 /g/g0/grondo/flux/bin/flux start flux getattr size
2018-05-10T20:37:56.019185Z broker.err[0]: rc1: /g/g0/grondo/flux/etc/flux/rc1: line 26: /g/g0/grondo/flux/etc/flux/rc1.d/01-enclosing-instance: Bad file descriptor
2018-05-10T20:37:56.019531Z broker.err[0]: Run level 1 Exited with non-zero status (rc=1) 0.2s
srun: error: ipa1: task 0: Exited with exit code 1
 grondo@ipa1:~$ srun --pty -N2 /g/g0/grondo/flux/bin/flux start flux getattr size
2

If I comment out the exec of $rcfile in the rc script, a flux session actually runs under spindle:

 grondo@ipa1:~$ spindle srun --pty -N2 /g/g0/grondo/flux/bin/flux start
(flux-FA86Tx) grondo@ipa1:~$ 

Normal commands do work:

(flux-FA86Tx) grondo@ipa1:~$ flux getattr size
2
(flux-FA86Tx) grondo@ipa1:~$ flux kvs ls      
resource

However, commands that are Lua scripts don't:

(flux-FA86Tx) grondo@ipa1:~$ flux exec hostname
lua: cannot open exec: No such file or directory
(flux-FA86Tx) grondo@ipa1:~$ flux wreck ls
lua: cannot open wreck: No such file or directory
trws commented 6 years ago

I was just poking at this and finding the same thing as you did @grondo. If we turn off shared library relocation, something I thought might be causing issues with lua, it actually segfaults srun:

scogland at quartz34 in ~  (SLURM:714468)
$ spindle -a no -l no --slurm srun -n 2 -N 2  $(which flux) broker flux wreckrun -n 2 hostname
srun: error: quartz34: task 0: Segmentation fault (core dumped)
srun: First task exited 30s ago
srun: task 1: running
srun: task 0: exited abnormally
srun: Terminating job step 714468.5
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: quartz35: task 1: Killed
trws commented 6 years ago

On the more promising side, the spindle environment setup does propagate into the shell set up by the broker, so if we can get through the first few hurdles we may have a relatively easy path to at least a for-now solution.

mcfadden8 commented 6 years ago

@trws, there is a features/readlink branch of hpc/Spindle that should address the absolute versus relative paths, the file permission issues, and the /proc/PID/exe issues. You should be able to run this without the "--debug=yes" and and "-a no" flags. Could you try running again with this updated version of spindle?

dongahn commented 6 years ago

OK. I am adding this to a potential target for our next release.

dongahn commented 6 years ago

@mplegendre is testing an alternative mode in spindle so that we may not have to depend on IBM to provide jsrun --spindle. He will also install the latest spindle on x86_64 so we can test flux's compatibility with spindle there as well.

We will meet Monday with the customer to strategy this further.

dongahn commented 6 years ago

The release candidate of Spindle is installed: /collab/usr/global/tools/spindle/toss_3_x86_64_ib/0.11rc1.

garlick commented 6 years ago

I'm getting a bit lost reading through this discussion. What are the next steps here?

dongahn commented 6 years ago

Just running some flux tests under this new version of Spindle will do. @mplegendre and @mcfadden8 said flux will likely run okay with this version. @SteVwonder plans to test this.

mplegendre commented 6 years ago

There is a short-term and long-term way to move forward on this. Short term, and the current goal, is to run flux under Spindle. You should (I think) be able to treat flux as an arbitrary program that Spindle manages. Spindle can follow forks/exec and spawns from srun into other processes, so we should by managing flux through spindle we'll manage the applications flux spawns.

This should just look like running flux tests via: "spindle srun -n X flux-test-stuff ...". You can check if spindle worked on the test application by checking the /proc/PID/maps of that app and see if the executable/libraries are mmap'd from spindle-managed area. We think we fixed the stuff that prevented this from working last time.

The long-term solution is to integrate Spindle's job launch API and libraries directly into flux. You can find documentation on that here: https://github.com/hpc/Spindle/blob/devel/doc/spindle_launch_README.md. This would allow users to run Spindle via flux/wreckrun options, and be much more portable.

SteVwonder commented 6 years ago

Ok. I'm testing out running the latest version of Flux under the release candidate of Spindle.

Sanity check: /collab/usr/global/tools/spindle/toss_3_x86_64_ib/0.11rc1/bin/spindle -f no srun --pty -N1 -n1 flux start (AKA no relocation of fork'd child processes) works just fine, but obviously isn't that helpful for our purposes.

For the real test (EDIT: I was originally using the wrong flux binary. I've edited the output here with the correct test):

# herbein1 at hype204 in /nfs/tmp2/herbein1/spindle-test [12:59:23]
→ SPINDLE_DEBUG=2 /collab/usr/global/tools/spindle/toss_3_x86_64_ib/0.11rc1/bin/spindle srun -N1 -n1 flux start
2018-07-17T19:59:31.575092Z broker.err[0]: rc1: /g/g0/herbein1/opt/packages/flux-core/master/etc/flux/rc1: line 26: /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance: Bad file descriptor
2018-07-17T19:59:31.575474Z broker.err[0]: Run level 1 Exited with non-zero status (rc=1) 0.4s
srun: error: hype204: task 0: Exited with exit code 1

# herbein1 at hype204 in /nfs/tmp2/herbein1/spindle-test [12:59:32]
→ grep "01-enclosing-instance" ./spindle_output.hype204.13318 
[Client.2.13391@intercept_exec.c:270] execve_wrapper - Intercepted execve on /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance
[Client.2.13391@intercept_exec.c:134] find_exec - Requesting stat on exec of /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance to validate file
[Server.13352@ldcs_audit_server_handlers.c:292] handle_client_file_request - Server recvd query stat for */usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance.  Dir = /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d, File = 01-enclosing-instance
[Server.13352@ldcs_audit_server_handlers.c:1716] handle_metadata_and_broadcast_file - Stating and broadcasting file */usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance
[Server.13352@ldcs_audit_server_handlers.c:1881] handle_broadcast_metadata - Broadcasting metadata result for */usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance to network (exists)
[Client.2.13391@intercept_exec.c:147] find_exec - Exec operation requesting file: /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance
[Client.2.13391@client.c:510] get_relocated_file - Send file request to server: /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance
[Server.13352@ldcs_audit_server_handlers.c:292] handle_client_file_request - Server recvd query exact path for /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance.  Dir = /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d, File = 01-enclosing-instance
[Server.13352@ldcs_audit_server_handlers.c:342] handle_howto_file - Looked for file /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance in cache... file not found
[Server.13352@ldcs_audit_server_handlers.c:342] handle_howto_file - Looked for file /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance in cache... file found
[Server.13352@ldcs_audit_server_handlers.c:656] handle_read_and_broadcast_file - Reading and broadcasting file /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance
[Server.13352@ldcs_audit_server_handlers.c:552] handle_setup_file_buffer - Allocating buffer space for file /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance
[Server.13352@ldcs_audit_server_handlers.c:598] handle_setup_file_buffer - Allocated space for file /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance with local file /var/tmp/herbein1/spindle.13316/4a-_usr_workspace_wsb_herbein1_packages_toss3_flux-core_master_etc_flux_rc1.d_01-enclosing-instance and mmap'd at 0x2aaaaaccb000
[Server.13352@ldcs_audit_server_filemngt.c:124] filemngt_read_file - Reading file /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance from disk
[Server.13352@ldcs_audit_server_handlers.c:342] handle_howto_file - Looked for file /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance in cache... file found
[Client.2.13391@client.c:512] get_relocated_file - Recv file from server: /var/tmp/herbein1/spindle.13316/4a-_usr_workspace_wsb_herbein1_packages_toss3_flux-core_master_etc_flux_rc1.d_01-enclosing-instance
[Client.2.13391@intercept_exec.c:150] find_exec - Exec file request returned /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance -> /var/tmp/herbein1/spindle.13316/4a-_usr_workspace_wsb_herbein1_packages_toss3_flux-core_master_etc_flux_rc1.d_01-enclosing-instance with errcode 0
[Client.2.13391@intercept_exec.c:78] prep_exec - exec'ing original path /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance because we're running in remap mode
[Client.2.13391@intercept_exec.c:81] prep_exec - test_log(/var/tmp/herbein1/spindle.13316/4a-_usr_workspace_wsb_herbein1_packages_toss3_flux-core_master_etc_flux_rc1.d_01-enclosing-instance)
[Client.2.13391@intercept_exec.c:276] execve_wrapper - execve redirection of /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance to /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance

The above happens whether or not I use -a no. So I think we have made progress on that front.

I'm a little confused by the lines:

[Client.2.13391@intercept_exec.c:78] prep_exec - exec'ing original path /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance because we're running in remap mode
...
[Client.2.13391@intercept_exec.c:276] execve_wrapper - execve redirection of /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance to /usr/workspace/wsb/herbein1/packages/toss3/flux-core/master/etc/flux/rc1.d/01-enclosing-instance

Is that what is supposed to happen?

dongahn commented 6 years ago

Hmmm. Seems there is still an issue with Spindle? Given that MSLI's won't need Spindle at this point for their current Sierra push, I am inclined to not to make this as a blocker for our next release.

mplegendre commented 6 years ago

I fixed the Spindle issues that were preventing flux from running. You can find an install in /collab/usr/global/tools/spindle/toss_3_x86_64_ib/0.11rc2, or the fixed source on the 'devel' branch of hpc/spindle on github. Here's a basic test I could run that shows Spindle intercepting simple apps after starting alongside flux (checking /proc/self/maps is the easiest way to see if Spindle ran):

rzgenie10:~% spindle srun --pty -N 2 -n 2 flux start
rzgenie10:~% flux wreckrun -n 2 hostname
rzgenie10
rzgenie11
rzgenie10:~% flux wreckrun -n 2 cat /proc/self/maps | grep cat
00400000-0040b000 r-xp 00000000 00:28 11557255  /tmp/legendre/spindle.128426/69-_usr_bin_cat
0060b000-0060c000 r--p 0000b000 00:28 11557255  /tmp/legendre/spindle.128426/69-_usr_bin_cat
00400000-0040b000 r-xp 00000000 00:28 11361012  /tmp/legendre/spindle.128426/69-_usr_bin_cat
0060c000-0060d000 rw-p 0000c000 00:28 11557255  /tmp/legendre/spindle.128426/69-_usr_bin_cat
0060b000-0060c000 r--p 0000b000 00:28 11361012  /tmp/legendre/spindle.128426/69-_usr_bin_cat
0060c000-0060d000 rw-p 0000c000 00:28 11361012  /tmp/legendre/spindle.128426/69-_usr_bin_cat
rzgenie10:~%

There were two Spindle problems, both related to corner-cases of script handling. I don't know if flux's corner-case behavior is intended:

SteVwonder commented 6 years ago

Thanks @mplegendre! I'll take another shot this week with your fixes.

grondo commented 6 years ago

Flux is exec'ing the /etc/flux/rc1.d/01-enclosing-instance and /etc/flux/rc1.d/02-hostlist on start as-if they were scripts, but these have malformed interpreter lines that are actually comments. These exec syscalls fail with a bad interpreter return, but Spindle was first going bad trying to understand their interpreter line.

Those should probably have the interpreter shebang line added, good catch! The initial underlying exec fails, but the scripts are eventually run because bash sees they are text files and executes them as bash scripts.

garlick commented 6 years ago

Yeah thanks!

Flux is exec'ing the /usr/libexec/flux/cmd/* scripts with argv[0] set to the first argument (what is typically argv[1]). Spindle was improperly handling this case when the exec target was a script.

Hmm, the flux command driver calls execv() with argv[0] set to the basename of the executable subcommand (script or otherwise). What led you to think argv[0] is set to the first argument? Does spindle expect it to be set to the full path of the executable?