binpash / pash

PaSh: Light-touch Data-Parallel Shell Processing
MIT License
547 stars 37 forks source link

Questions about pash. #691

Closed nkh closed 1 year ago

nkh commented 1 year ago

Hi, thank you for the efforts put into pash.

I've read the paper and looked at presentations but my knowledge is still superficial.

I can't find, but maybe have missed, anything about different processes that could be run in parallel; and example would be two unrelated pipeline which could be run concurrently.

I also wonder if pash could take an already, manually, parallelized script and propose changes.

here's a "simple" example of a script that generates multiple files some depending on previously generated files.

https://github.com/nkh/tdiff/blob/main/tree_synch

I could generate the files through an inference engine (make) which would remove the burden of manually parallelizing the process (something I may do as an exercise) but that still relies on dependency knowledge that has to be manually written, my question is: is it possible to analyze the script and list the dependencies, and possibly present a better workflow?

angelhof commented 1 year ago

Hello @nkh! PaSh currently parallelizes scripts if it is given annotations for some commands in these scripts. We have written multiple annotations for common commands in POSIX and GNU Coreutils (e.g., tr, sort, etc) but that doesn't work for arbitrary programs (like the perl scripts in your example).

However, we are currently designing an extension of PaSh that can determine dependencies at runtime (without any prior command knowledge) and parallelize based on that. This work could also be later extended to list dependencies and present a better workflow after the script is done executing for the first time. This extension is not yet ready but we have written a short paper on it that might interest you (https://sigops.org/s/conferences/hotos/2023/papers/liargkovas.pdf).

nkh commented 1 year ago

Hi @angelhof , thank you for the answer.

It's a bash script :) but I must be doing something right if I can make it look like perl.

I'll have a look at the paper as it's exactly what I'm talking about some commands in random order, but with dependencies, that could be parallelized or commands that are already parallelized, also with dependencies like in the script, but which could be parallelized better. I don't think the script above could be parallelized better but being able to verify it or see it sequenced differently is interesting.

I started rewriting it in make but it's much more complex to refactor it than to start with make in the first place.

I will gladly read your paper and maybe you can consider generating a makefile for the parallelization is work very well.

angelhof commented 1 year ago

I meant that your bash script invokes perl scripts during its execution!

I will gladly read your paper and maybe you can consider generating a makefile for the parallelization is work very well.

That is something that we have considered and might do if we find the time :)

Closing the issue for now, please reopen if you have additional questions.