Add @merge_regex option to config

ewels / clusterflow

A pipelining tool to automate and standardise bioinformatics analyses on cluster environments.

https://ewels.github.io/clusterflow/

GNU General Public License v3.0

97 stars 27 forks source link

Add @merge_regex option to config #38

Closed ewels closed 9 years ago

ewels commented 9 years ago

Would be cool if Cluster Flow could automatically recognise fastq files that should be merged before processing.

[x] Add @merge_regex to config so that people can define their own filename matches
[x] Write new module to merge fastq files (gzipped and normal)
[x] Update core to look for matches
- [x] Group files accurately, taking into account --split_files
- [x] Prepend merging module onto pipeline

ewels commented 9 years ago

I think the core will need a more substantial update than just changing --split_files as we don't know how many files there will be to merge and so on.

Maybe instead of looping through the list of files with

for (my $i = 0; $i <= $#files; $i++){
  # Add files to run file
    my $max_i = $i + $SPLIT_FILES;
    my $counter = 0;
    for (; $i < $max_i; $i++){
        ...
    }
}

We should instead make a hash of arrays of files, based on both @merge_regex and then $SPLIT_FILES (which will count the merged files rather than the input files).

ewels commented 9 years ago

Core code written, now testing.

ewels commented 9 years ago

Tested and happy.

ewels commented 9 years ago

Need to test when regex doesn't match anything. Or when no regex is present.

ewels commented 9 years ago

Need to improve gzip concatenation due to slide from @s-andrews:

cat seq1.fq.gz seq2.fq.gz > all1.fq.gz Some decompressors (gzip for example) will read all of the data from all1.fq.gz, but others (the java GZipInputStream class for example) will not and will silently finish at the the end of the first concatenated file.

ewels commented 9 years ago

Done a bunch of testing and bugfixing, happy that this is fairly stable now.