ewels / clusterflow

A pipelining tool to automate and standardise bioinformatics analyses on cluster environments.
https://ewels.github.io/clusterflow/
GNU General Public License v3.0
97 stars 27 forks source link

Restructure dependencies #93

Closed s-andrews closed 8 years ago

s-andrews commented 8 years ago

Probably something to tackle after v0.4, and I know we've mentioned this before, but I thought I'd write it down.

The way that dependencies are structured in clusterflow doesn't work as well as it should. At the moment depdendencies are written into pipelines and are checked when the pipeline is run.

Dependencies should be part of the module API, and that's the right place to check them. Pipelines are just a collection of modules so you shouldn't need to put anything into the pipelines. This would then make things work for both running a pipeline or a single module, and it would mean you don't have to update the pipeline if something changes in a module.

The mechanism for dependencies could also be simplified. At the moment you're putting explicit fixed dependencies down which I guess makes life easier if you want to read through what's needed, but a simpler solution would be to do a sanity check run of each module and give it a chance to object. This would be an API call where you pass it the config for the run and let it generate errors - you then present these back to the user before running the pipeline. This would then let the module generate errors in context and not be limited in what it's combinations and conditions for errors would be.

What do you think?

ewels commented 8 years ago

When you say dependencies, do you mean reference genomes? I think that this is already implemented in the new code that I merged back to you? eg. See commit a2f78c23c1bb30071208ec7595c575cbc3f74e96 (dependency listed in the hisat2 module here).

s-andrews commented 8 years ago

Haven't looked at the code, but the pipelines in the latest version still have dependencies listed at the top - these shouldn't be needed if they're all handled in the modules. Is that just left over syntax?

ewels commented 8 years ago

Do you have an example of such a pipeline? The files I'm looking at just have the help text and pipeline steps. e.g. the fastq_star pipeline looks like this:

/*
------------------------
FastQ to STAR Pipeline
------------------------
This pipeline takes FastQ files as input, runs FastQC, Trim Galore,
then aligns with STAR. The module requires a genome reference with a
corresponding STAR index base in the configuration.
*/

#fastqc
#trim_galore
    #star
s-andrews commented 8 years ago

Yeah - about that.

I looked at the first one (bam_preseq) and saw:

@require_python_package ngi_visualizations

..which again is something which applies to a module rather than a pipeline, but now I've looked properly I can see all the rest has gone.

I'm a bit of a muppet some days.

ewels commented 8 years ago

Ah yeah, I can see why you would be confused! That's a bit of a relic from when I was keen on getting Cluster Flow to run Python. I've gone off the idea a bit since then, mostly as preseq_plot.cfmod.py never really worked properly and required a SciLifeLab plotting library which isn't very portable.

This module is especially redundant with the arrival of MultiQC. I'll add an issue to remove this code.

ewels commented 8 years ago

See #94