ewels / clusterflow

A pipelining tool to automate and standardise bioinformatics analyses on cluster environments.
https://ewels.github.io/clusterflow/
GNU General Public License v3.0
97 stars 27 forks source link

Clusterflow with MultiQC #70

Closed darogan closed 8 years ago

darogan commented 8 years ago

Would it be possible to be able to add support to run MultiQC after clusterflow has processed a full set of samples? As I understand, clusterflow has some awareness for when a pipeline has processed all samples and sends an email.

ewels commented 8 years ago

Great question, and thankfully an easy answer - yes! One of the new features in v0.4 is "summary modules", which are modelled on the e-mail setup you mention. There aren't really any docs about them yet, but they're described in #8 - you can see an example in the fastq_bismark pipeline which runs the bismark summary report module (a predecessor to MultiQC).

So, what we'd need to do is to create a new module for MultiQC and then add it to the end of relevant pipelines, prefixed with a > character.

I was also thinking of adding a config option to allow prefix and suffix pipelines - these would be automatically added to any other pipeline every time Cluster Flow is launched. This would be an ideal way to get tools like FastQC and MultiQC to always run, irrespective of the pipeline. Not sure if this is worth the effort however, as I guess it's pretty easy to add these things to pipelines.

darogan commented 8 years ago

Great we have commissioned @s-andrews to write a cf pipeline similar to fastq_bismark ending with a summary MultiQC. I'll pass on these details

ewels commented 8 years ago

Cool! I've just written a module for MultiQC and it seems to work from a quick test. So should be ready to go now..

Phil

ewels commented 8 years ago

I'll leave this open for now until we've run it successfully a few times. Then I'll add it to some pipelines (all pipelines?) and close the issue.

s-andrews commented 8 years ago

Turning on my grumpy mode - I'm not sure it would be great to add it to all pipelines. It's not always going to be the case that the full set of sequences you'd want to summarise will be in the same run, so you'd have to clean up the MutiQC stuff, before running it again at the end when you actually had everything processed.

Turning on my feature request mode - what we really need is a module / pipeline respository rather than a fixed set of pipelines distributed with the program so we can flexibly bring in the pipelines which make sense :-)

ewels commented 8 years ago

Yeah, I guess - this has crossed my mind before, but we can just fun MultiQC with -f force mode and it will overwrite what was there before. So it's not really any extra work. In practice that's how I've been using it - rerunning MultiQC each time I try a new step of analysis or adding any samples.

And I entirely agree about a module / pipeline repository. I'll create a new issue to collect ideas on that :)

s-andrews commented 8 years ago

So how are you collecting the files to add to the report? Are you just picking up everything in a single directory? If so, where do you get that directory from? What happens if you process batches of files from multiple directories through a pipeline?

ewels commented 8 years ago

Yes - it just crawls whatever directory you give to it (recursively) and produces a report with whatever it finds. For the Cluster Flow module I specified the current working directory, which should pick up the Cluster Flow run. You can specify multiple directories as well if you want to.

For batches of files I guess you're talking about shared sample names? In which case you can specify the -d parameter which sticks on the directory to the start of the sample name to keep them unique. Otherwise they get overwritten. If you're talking about subdirectories then that should be fine as it traverses the file system recursively so should find everything.

Cluster Flow module repository idea now is in #71.

ewels commented 8 years ago

Closing this issue for now, as the CF MultiQC module is written. Please reopen if any bugs are found.