Logging/Monitoring PaSh spawned processes

angelhof commented 2 years ago

Currently, it is pretty hard to have any visibility on the dataflow processes that are spawned by PaSh. It would be great if we could have some form of observability of their status/progress.

One way to do it would be to log events related to each of these processes (running, progress (based on INPUT/OUTPUT processed), exit status, etc). Then we could develop a separate extension to analyze and show these logs.

For more info see discussion in #491

asharma0703 commented 2 years ago

Do you have any ideas for how these events could be logged? Not super familiar with prior art here but from some cursory searches I think adding an strace option in debug mode might make sense.

For example if you prepend strace -Y -f -e execve to the args for execution of the compiled script in compiler/pash.py, and run the input from #491, you get some output like this (in between all the spawns/exits during setup/teardown):

[pid 124519<bash>] +++ exited with 0 +++
[pid 124507<auto-split.sh>] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=124519<bash>, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
strace: Process 124522 attached
[pid 124522<bash>] execve("/home/arnav/pash/runtime/split", ["/home/arnav/pash/runtime/split", "/tmp/pash_1Zz3NvgG6J", "107860", "/tmp/pash_EzzpX5c/1c722115196647"..., "/tmp/pash_EzzpX5c/1c722115196647"...], 0x56012cd5f380 /* 93 vars */) = 0
[pid 124509<bash>] execve("/usr/bin/tr", ["tr", "-s", " ", ""], 0x55ad1e5942a0 /* 93 vars */ <unfinished ...>
[pid 124508<bash>] execve("/usr/bin/tr", ["tr", "-s", " ", ""], 0x55ad1e5941c0 /* 93 vars */ <unfinished ...>
[pid 124509<tr>] <... execve resumed>)  = 0
[pid 124508<tr>] <... execve resumed>)  = 0
tr: tr: when not truncating set1, string2 must be non-emptywhen not truncating set1, string2 must be non-empty

[pid 124509<tr>] +++ exited with 1 +++
[pid 124508<tr>] +++ exited with 1 +++
[pid 124517<eager>] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=124517<eager>, si_uid=1000} ---
[pid 124517<eager>] +++ killed by SIGPIPE +++
[pid 124522<split>] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=124522<split>, si_uid=1000} ---
[pid 124518<eager>] +++ exited with 0 +++
[pid 124522<split>] +++ killed by SIGPIPE +++
[pid 124507<auto-split.sh>] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=124522, si_uid=1000, si_status=SIGPIPE, si_utime=0, si_stime=0} ---
[pid 124511<eager.sh>] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=124518, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
[pid 124510<eager.sh>] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=124517, si_uid=1000, si_status=SIGPIPE, si_utime=0, si_stime=0} ---
strace: Process 124523 attached
[pid 124515<eager>] +++ exited with 0 +++
[pid 124513<cat>] +++ exited with 0 +++
[pid 124512<eager.sh>] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=124515, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
[pid 124473<env>] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=124506, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
[pid 124523<bash>] execve("/usr/bin/rm", ["rm", "-f", "/tmp/pash_1Zz3NvgG6J"], 0x56012cd583c0 /* 93 vars */strace: Process 124524 attached
) = 0

It's verbose / not the easiest to parse, and probably something to include only in debug mode due to strace performance impact. It also doesn't really address the issue of progress on input data. Might help debugging issues with exit codes like the above though.

angelhof commented 2 years ago

Thanks a lot for this suggestion! That would be a great way to debug these issues, and we could only enable it when doing heavyweight fine-grained debugging.

Do you have any ideas for how these events could be logged? Not super familiar with prior art here but from some cursory searches I think adding an strace option in debug mode might make sense.

Something that would also be possible for a more lightweight instrumentation would be to do the modifications at the script level, i.e., modifying the produced parallel script to do the following:

Monitor process status
Add some form of monitoring command (like pv) between commands in the dataflow to monitor throughput etc.

For reference, the parallel script produced by PaSh can be seen if PaSh is called using the -d 1 option. For example:

$ cat test.sh
cat README.md | tr A-Z a-z | sort
$ ./pa.sh -d 1 test.sh > /dev/null
...
rm_pash_fifos() {
{ rm -f "/tmp/pash_CODhJWu/63f16760c5ba4ff5ad1b8707209a2dcc/#fifo2" ; }
  ...
}
mkfifo_pash_fifos() {
{ mkfifo "/tmp/pash_CODhJWu/63f16760c5ba4ff5ad1b8707209a2dcc/#fifo2" ; }
  ...
}
rm_pash_fifos
mkfifo_pash_fifos
pids_to_kill=""
{ cat README.md >"/tmp/pash_CODhJWu/63f16760c5ba4ff5ad1b8707209a2dcc/#fifo2" & }
pids_to_kill="${!} ${pids_to_kill}"
{ auto-split.sh "/tmp/pash_CODhJWu/63f16760c5ba4ff5ad1b8707209a2dcc/#fifo2" "/tmp/pash_CODhJWu/ba84f90ccccd40a88a9431e7ee2cb1a2/#fifo7" "/tmp/pash_CODhJWu/ba84f90ccccd40a88a9431e7ee2cb1a2/#fifo8" & }
pids_to_kill="${!} ${pids_to_kill}"
{ tr A-Z a-z <"/tmp/pash_CODhJWu/4b003a3b930540ab95504f2823895894/#fifo15" >"/tmp/pash_CODhJWu/ba84f90ccccd40a88a9431e7ee2cb1a2/#fifo10" & }
pids_to_kill="${!} ${pids_to_kill}"
{ tr A-Z a-z <"/tmp/pash_CODhJWu/ba84f90ccccd40a88a9431e7ee2cb1a2/#fifo8" >"/tmp/pash_CODhJWu/ba84f90ccccd40a88a9431e7ee2cb1a2/#fifo11" & }
pids_to_kill="${!} ${pids_to_kill}"
{ sort <"/tmp/pash_CODhJWu/ba84f90ccccd40a88a9431e7ee2cb1a2/#fifo10" >"/tmp/pash_CODhJWu/ba84f90ccccd40a88a9431e7ee2cb1a2/#fifo12" & }
pids_to_kill="${!} ${pids_to_kill}"
{ sort <"/tmp/pash_CODhJWu/ba84f90ccccd40a88a9431e7ee2cb1a2/#fifo11" >"/tmp/pash_CODhJWu/ba84f90ccccd40a88a9431e7ee2cb1a2/#fifo13" & }
pids_to_kill="${!} ${pids_to_kill}"
{ eager.sh "/tmp/pash_CODhJWu/ba84f90ccccd40a88a9431e7ee2cb1a2/#fifo7" "/tmp/pash_CODhJWu/4b003a3b930540ab95504f2823895894/#fifo15" /tmp/pash_CODhJWu/469d17904a62453eb710dbb9e7e13554/pash_eager_intermediate_1 & }
pids_to_kill="${!} ${pids_to_kill}"
{ eager.sh "/tmp/pash_CODhJWu/ba84f90ccccd40a88a9431e7ee2cb1a2/#fifo12" "/tmp/pash_CODhJWu/4b003a3b930540ab95504f2823895894/#fifo16" /tmp/pash_CODhJWu/469d17904a62453eb710dbb9e7e13554/pash_eager_intermediate_2 & }
pids_to_kill="${!} ${pids_to_kill}"
{ runtime/eager.sh "/tmp/pash_CODhJWu/ba84f90ccccd40a88a9431e7ee2cb1a2/#fifo13" "/tmp/pash_CODhJWu/4b003a3b930540ab95504f2823895894/#fifo17" /tmp/pash_CODhJWu/469d17904a62453eb710dbb9e7e13554/pash_eager_intermediate_3 & }
pids_to_kill="${!} ${pids_to_kill}"
{ sort -m "/tmp/pash_CODhJWu/4b003a3b930540ab95504f2823895894/#fifo16" "/tmp/pash_CODhJWu/4b003a3b930540ab95504f2823895894/#fifo17" & }
pids_to_kill="${!} ${pids_to_kill}"
source wait_for_output_and_sigpipe_rest.sh ${!}
rm_pash_fifos
( exit "${internal_exec_status}" )
...

In this parallel script we could add pv nodes in the place of every fifo, and monitor the spawned commands (since we already have their pids). I am not sure if that introduce any other issues though.

binpash / pash

Logging/Monitoring PaSh spawned processes #500