dspinellis / dgsh

Shell supporting pipelines to and from multiple processes
http://www.spinellis.gr/sw/dgsh/
Other
323 stars 22 forks source link

[RFE] Parallelize processes receiving different input streams #102

Closed trantor closed 6 years ago

trantor commented 6 years ago

Hello. I've tagged the issue as a Request For Enhancement, but I am not sure whether it's something which can actually be achieved right now or, for instance, whether it is completely outside the objectives of the project and/or another software is a better fit for this use case.

What I would like to see is the ability not to distribute to several different processes the same input stream, in order to do process each copy differently and then gather the different outputs, but rather being able to basically split the output of a program across several output channels, each of them being the input for processes executing the same actions.

Just to exemplify, let's say that I pipe some text to a program which outputs to different output channels different portions of the input text; each of this portions would become the input of the same command, obviously giving different outputs themselves.

You can do something similar with current multipipe blocks, for instance something like

cat file |
tee |
{{
    grep pattern1 |
    processing_command &

    grep pattern2 |
    processing_command &
}}

but you have a finite and predetermined number of pipelines in the multipipe block. If, for instance, I wanted to filter the input stream using a non-predetermined number of patterns instead of 2,3,4,etc. such as the 2 in the example, I can't do that.

What I'd like to do, although I suppose it's not a graph such as the ones you've envisioned anymore, would be something like

cat file |
splitter_program | #splits the input into a variable number of chunks across N output channels
{{{{
    processing_command &
}}}}

I hope I've been able to explain myself in a vaguely decent manner. Let me know gentlemen what you think.

mfragkoulis commented 6 years ago

The ability to scatter input is already available by dgsh-tee -s (thanks to @dspinellis :-). It will create N chunks of input for N available output channels. You can take a look at parallel-word-count.sh Does this feature cover the use cases you have in mind?

dspinellis commented 6 years ago

Interesting! I don't know what you want the splitter program and the downstream program to do. You might want to look at the implementation of dgsh-parallel to get some ideas.

trantor commented 6 years ago

Thanks gentlemen, I'll take a look at what you suggested.

@dspinellis, basically I'm trying to execute in a parallel fashion a programme providing to each instance different contents over stdin, contents generated by splitting programmatically an input stream according to arbitrary rules. Most programmes handling parallel tasks do not handle stdin they way I want unless you use some sort of parametrized input filenames (e.g. file1, file2, file3, etc.), which of course I don't want to do. Although now that I think about it even running each instance with a different input sequentially wouldn't be bad. In fact it would be better for my case. Something à la xargs -P1, just to give an idea.

trantor commented 6 years ago

A question: if I wanted to wrap, for instance, a perl script in order to use several "output" file descriptors using dgsh-wrap, how would I go about doing that? I saw a few example with multiple inputs but not one with multiple outputs ...

dspinellis commented 6 years ago

If you pass one or more >| arguments to the Perl script, dgsh-wrap will change it into an allocated multipe descriptor. Does this help?

trantor commented 6 years ago

Uhmm ... yes and no. Because that would mean a fixed predetermined number of file descriptors. What if I wanted to allocate a dynamic number of them depending on the input I am actually processing?

dspinellis commented 6 years ago

I am sorry Dave. I cannot do that. pipes are allocated and connected before data starts flowing. Consequently, currently you can't get this type of behavior. Maybe it can be somehow simulated by creating the highest number of connections and only using some of them.

trantor commented 6 years ago

Thanks gentlemen. I think I managed in a different fashion.