ewels / clusterflow

A pipelining tool to automate and standardise bioinformatics analyses on cluster environments.
https://ewels.github.io/clusterflow/
GNU General Public License v3.0
97 stars 27 forks source link

Add config option for job parallelisation #42

Closed ewels closed 9 years ago

ewels commented 9 years ago

On some very busy clusters, Cluster Flow can be slow due to it's style of submitting a separate queue job for every module. Whilst this is in theory the fastest way to process jobs because multiple steps in the same pipeline run can run side by side, in practice it can mean that the pipeline sits and waits for ten hours in the queue waiting to run a 10 second module which sends the completion e-mail.

So, to get around this, I could add a config option to choose how to parallelise. Three options: per_module (default), per_run and per_pipeline.

I have recently added functionality for Cluster Flow to run locally by running bash scripts. This behaviour could be extended to run the entire pipeline in series (per_pipeline) by submitting this bash script to the queue as a single job. For per_run, CF could write multiple bash scripts in this style and submit them each into a single queue job, with one final job containing summary modules, dependent on the others.

ewels commented 9 years ago

If the job time specification works well (see #45), then this might not be necessary..

ewels commented 9 years ago

Seems to have been working well lately, so probably unnecessary. Closing this for now.