ewels / clusterflow

A pipelining tool to automate and standardise bioinformatics analyses on cluster environments.
https://ewels.github.io/clusterflow/
GNU General Public License v3.0
97 stars 27 forks source link

Support multiple outputs #63

Open ewels opened 9 years ago

ewels commented 9 years ago

One limitation of modules currently is that they typically only output one file type. Also, sample grouping information is lost. This is fine when, for example, an alignment module has five inputs and creates five outputs. But if an analysis module creates three different output files for each input (and different downstream modules could make use of different file types or combinations), it gets more difficult.

To add additional complexity, output files could be assigned a group based on the original starting file. Modules could then filter the input that they use by file type, and output as many files as possible.

Currently:

start_000    file1.fq.gz
start_000    file2.fq.gz
align_838    file1.bam
align_838    file2.bam
analyse_239    file1_stats.csv
analyse_239    file1_filtered.bam
analyse_239    file2_stats.csv
analyse_239    file2_filtered.bam

Suggested:

start_000    file1    file1.fq.gz
start_000    file2    file2.fq.gz
align_838    file1    file1.bam
align_838    file2    file2.bam
analyse_239    file1    file1_stats.csv
analyse_239    file1    file1_filtered.bam
analyse_239    file2    file2_stats.csv
analyse_239    file2    file2_filtered.bam

This grouping means that modules downstream of analyse_239 can use both the alignments and the stats file in combination safely, knowing that they came from the same sample.

We could also add a 'type' field to describe the kind of output being generated. Modules could then list the input types needed and output types generated at the --request stage, enabling a pipeline to be checked for compatability. e.g:

start_000    file1    fastq    file1.fq.gz
start_000    file2    fastq    file2.fq.gz
align_838    file1    bam    file1.bam
align_838    file2    bam    file2.bam
analyse_239    file1    stats    file1_stats.csv
analyse_239    file1    counts    file1_counts.csv
analyse_239    file1    bam    file1_filtered.bam
analyse_239    file2    stats    file2_stats.csv
analyse_239    file2    counts    file2_counts.csv
analyse_239    file2    bam    file2_filtered.bam

Using a named tag would enable differentiation between different types of files. For instance, if a module generates a csv with a specific format, that could be included in the name.

This is a fairly major change in behaviour and not a top priority. Just a thought at this point, but could be useful as cluster flow gains more modules and is able to handle more complex pipelines.