ewels / clusterflow

A pipelining tool to automate and standardise bioinformatics analyses on cluster environments.
https://ewels.github.io/clusterflow/
GNU General Public License v3.0
97 stars 27 forks source link

Summary modules #8

Closed ewels closed 9 years ago

ewels commented 10 years ago

Cluster Flow currently has a single hard-coded summary module which runs when all jobs in a pipeline are finished: cf_all_runs_finished.cfmod. This module effectively knits together all of the parallelised dependencies so that I can send a single report e-mail.

It would be nice if users could write their own modules with similar post-parallelisation functionality. Initially this could be for custom report generation within specific pipelines, though it could be extended to more complicated tasks in the future (eg. merging files and then running downstream modules on the result).

I think that the best way to handle this is by using a second special character. # is currently used to denote a module, so perhaps > could denote a module with collecting function. eg:

#trim_galore
    #bowtie
        >bowtie_report

..supplied with 3 files:

- trim_galore - bowtie \
- trim_galore - bowtie \
- trim_galore - bowtie \
                        - bowtie_report

This would partially future proof for subsequent modules. For example, the following would work:

#trim_galore
    #bowtie
        >merge_alignments
            >further_processing
                >report

..supplied with 3 files:

- trim_galore - bowtie \
- trim_galore - bowtie \
- trim_galore - bowtie \
                        - merge_alignments - further_processing - report

Note that this extension will not support splitting or partial merging. It's all or nothing.

Once a > module is used in a pipeline, no further # modules can be specified (doing so will raise an error). All subsequent summary modules will be processed in series (indentation will be ignored). In other words, the following would not work:

#trim_galore
    #bowtie
        >merge_alignments
            #further_processing_1
            #further_processing_2
            #further_processing_3
                >report

If written like this, summary modules will have to be supplied with multiple run files. I'm not sure how to handle this yet whilst maintaining compatibility with the way that modules currently work.

mspivakov commented 10 years ago

Looks like a neat syntax for what would clearly be a useful option :)