binpash / pash

PaSh: Light-touch Data-Parallel Shell Processing
MIT License
546 stars 36 forks source link

PaSh overly simplifies scheduling constraints on scripts (waits for a all commands to finish instead of one) #77

Open angelhof opened 3 years ago

angelhof commented 3 years ago

There is a bug in our compiler's compilation...

At the moment we compile the following:

cmd1 &
cmd2

to a dataflow with two nodes (cmd1, cmd2). Then we would compile this to a parallel script waiting for its outputs to be done.

cmd1 &
cmd2 &
wait

However, this adds additional dependencies and constraints in the final parallel script, because we wait for both cmd1 and cmd2 to finish. A way to solve this, would be to know exactly for which nodes in the final graph we wait for. by keeping this information from the original script.

This is not a terrible issue for the system (and can probably be solved in a hacky way), but it is a problem with the formal development related to PaSh.

First, we need to make some tests that show this issue and then we can solve it by keeping information about which nodes in the graph we wait for or we don't wait for. Then we need to figure out how to parallelize nodes that we need to wait for.

angelhof commented 3 years ago

Copying older note from #54 for completeness.

## This way of fixing the problem suffers from some issues.
##
## - First of all, gathering the children after the end of the graph
##   seems to gather more than just the alive nodes. This could lead
##   to killing some random pid in the system. This could potentially
##   be solved by gathering all pids incrementally.
##
## - In addition, this way of getting the last pid does not work if
##   there is more than one output. (This is never the case in our
##   tests, but could be.
##
## - Finally, it is not local, since all of the monitoring happens
##   globally. Ideally, it should be done by a wrapper in each -
##   node. The wrapper should monitor if the node dies, and if so it -
##   should send SIGPIPE to all its producers.