Parallelise "filter reads" step in Makefile

lweasel commented 9 years ago

At the moment, the filter reads step calls a Bash script, "filter_reads", which then calls the "filter_sample_reads" python script for each sample, to do species separation on the mapped reads for that sample.

One way of changing this might be to:

alter "filter_reads", so that for each sample, it first splits the mapped reads BAM file for each species into multiple (roughly equally sized?) name-ordered BAM files (in a temporary directory). It may be possible to do this with the --filter option of "sambamba view"? (Although that introduces a dependency on sambamba). Actually, this is probably going to be the hardest bit of the process to implement.
call a python script which will use the subprocess module to launch multiple instances of the "filter_sample_reads" python script for each chunk of the input BAM files. Poll these processes to work out when they've all finished.
concatenate the filtered BAM files for each input chunk, for each species. Tidy up the temporary directory.

s-heron commented 9 years ago

The block splitting has been implemented. On a test sample (1A1) it ran in 13m30.560s to split it into 4 blocks for both species. Several minutes can be shaved off this through termination of the sambamba file stream after each start id has been extracted, but I haven't found a workable way to do this. Parallelised execution of filtering on the blocks will be implemented after the new filter script has been written.

s-heron commented 9 years ago

Wrote & commited the parallelisation control script; filter_control.py

biomedicalinformaticsgroup / Sargasso

Parallelise "filter reads" step in Makefile #18