ewels / clusterflow

A pipelining tool to automate and standardise bioinformatics analyses on cluster environments.
https://ewels.github.io/clusterflow/
GNU General Public License v3.0
97 stars 27 forks source link

ChIP-seq pipeline: removing reads in blacklist regions #80

Closed orzechoj closed 8 years ago

orzechoj commented 8 years ago

Added a module to remove read in blacklist regions (e.g. from ENCODE), which uses bedtools intersect. (Also raised the memory for cf_merge)

ewels commented 8 years ago

Super cool! Any thoughts on using the genome reference config files to store the location for the blacklist file? Presumably it will always be the same for each species? So it could have its own reference type? eg:

@reference blacklist    GRCh37  /path/to/genomes/Human/GRCh37/blacklist/    Human   GRCh37
@reference blacklist    GRCm38  /path/to/genomes/Mouse/GRCm38/blacklist/    Mouse   GRCm38

Might be better than having to specify it as a param?

I've had this functionality in the back of my mind for a while now anyway, could be good in other pipelines too.. Thanks!

orzechoj commented 8 years ago

Hi,

Haven’t thought much about this..

I guess black list files don’t change much for an individual genome. But I could also see cases where you might use this module to remove reads using other files, e.g from a few manually curated regions, from everything overlapping lncRNAvor something else.

But as long as it’s possible to use params to set other “black list” files, it might be a good a idea to have a default option in the config.

cheers, Jakub

ewels commented 8 years ago

Yup - that could definitely work, start by looking for the a param file and if that's not found look for a genome file instead (should just be a couple of extra lines?). I just prefer to keep species specific stuff out of pipelines where possible, so that they can be used for any organism.

Phil

ewels commented 8 years ago

Thanks again @orzechoj! I'll add the minor suggestions.