cancerit / cgpPindel

Cancer Genome Project Insertion/Deletion detection pipeline based around Pindel
http://cancerit.github.io/cgpPindel/
GNU Affero General Public License v3.0
28 stars 5 forks source link

input generation: high depth region speedup #75

Closed keiranmraine closed 5 years ago

keiranmraine commented 5 years ago

feature/intervalTreeSpeedup

GRCh38 WGS data was taking exceptionally long to complete the input generation phase. To the extent that caveman using 60 cpus would complete before pindel (input capped at 4 cpu/input).

TL/DR

Cause

Using Tabix to exclude reads anchored in regions of high-depth.

Although tabix is ideal for testing if an overlap occurs is is not always efficient to do this. Even though it is only attempted on a candidate if it passes all other checks it is still very heavy. The reasons for this are:

  1. If there are records in the tabix file then the relevant block of the gz file is read, decompressed and passed up from the C to Perl layer... for every read (even if last request was same).
  2. In regions anticipated to have high depth there are many more reads.

Solution

The data to check for overlaps is actually very small, instead of using tabix bindings to verify a read doesn't overlap I've loaded the raw data into a set of IntervalTrees.

Testing the memory footprint of data loaded into IntervalTrees found the overhead to be ~1MB (input file 33K/144K compressed/uncompressed).

Testing

Tested on GRCh38 WXS input BAM of 8.3GB with commands:

# Inputs on lustre but ensure no first read bias
$ cat GRCh38_wxs_6798_PD7261a.bam > /dev/null

# original version
$ /usr/bin/time -v -o pindel_orig.time perl ~/GitHub/cgpPindel_orig/perl/bin/pindel_input_gen.pl -b GRCh38_wxs_6798_PD7261a.bam -o pindel_orig -r dockstore_ref/human_GRCh38/extracted/genome.fa -t 1 -e dockstore_ref/human_GRCh38/extracted/pindel/HiDepth.bed.gz >& pindel_orig.log &

# updated version
$ /usr/bin/time -v -o pindel_new.time perl ~/GitHub/cgpPindel/perl/bin/pindel_input_gen.pl -b GRCh38_wxs_6798_PD7261a.bam -o pindel_new -r dockstore_ref/human_GRCh38/extracted/genome.fa -1 -e dockstore_ref/human_GRCh38/extracted/pindel/HiDepth.bed.gz >& pindel_new.log &

Standard farm node, 32*cpu 256GB RAM, clear of other users/processes.

Timings

Normally 1 million reads are processed in ~30 seconds (on hardware used in test), in regions of high depth this can spike to over 200s (for this WXS data, in WGS data >600s often achieved).

From /usr/bin/time -v, trimmed to useful information

Old

User time (seconds): 7253.41
System time (seconds): 83.32
Percent of CPU this job got: 102%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:59:30
Maximum resident set size (kbytes): 4098176
Exit status: 0

New

User time (seconds): 3888.01
System time (seconds): 85.12
Percent of CPU this job got: 104%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:03:23
Maximum resident set size (kbytes): 4084816
Exit status: 0

Result comparison

$ ls -1 pindel_orig/ | xargs -tI {} zdiff -qs pindel_orig/{} pindel_new/{}
zdiff -qs pindel_orig/chr10.txt.gz pindel_new/chr10.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr11.txt.gz pindel_new/chr11.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr12.txt.gz pindel_new/chr12.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr13.txt.gz pindel_new/chr13.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr14.txt.gz pindel_new/chr14.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr15.txt.gz pindel_new/chr15.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr16.txt.gz pindel_new/chr16.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr17.txt.gz pindel_new/chr17.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr18.txt.gz pindel_new/chr18.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr19.txt.gz pindel_new/chr19.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr1.txt.gz pindel_new/chr1.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr20.txt.gz pindel_new/chr20.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr21.txt.gz pindel_new/chr21.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr22.txt.gz pindel_new/chr22.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr2.txt.gz pindel_new/chr2.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr3.txt.gz pindel_new/chr3.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr4.txt.gz pindel_new/chr4.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr5.txt.gz pindel_new/chr5.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr6.txt.gz pindel_new/chr6.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr7.txt.gz pindel_new/chr7.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr8.txt.gz pindel_new/chr8.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr9.txt.gz pindel_new/chr9.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chrM.txt.gz pindel_new/chrM.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chrX.txt.gz pindel_new/chrX.txt.gz 
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chrY.txt.gz pindel_new/chrY.txt.gz 
Files /dev/fd/5 and - are identical

Results are identical.

keiranmraine commented 5 years ago

Additional info:

WGS pair reduced wall time from 31:46:27 to 16:37:36 (h:mm:ss) (for pindel.pl exec)

AndyMenzies commented 5 years ago

Changes all look reasonable to me. Speed up looks great.