GRCh38 WGS data was taking exceptionally long to complete the input generation phase. To the extent that caveman using 60 cpus would complete before pindel (input capped at 4 cpu/input).
TL/DR
New method is ~47% quicker (7253s -> 3888s).
Results are identical (WXS test).
Negligible memory impact (actually less memory).
Cause
Using Tabix to exclude reads anchored in regions of high-depth.
Although tabix is ideal for testing if an overlap occurs is is not always efficient to do this. Even though it is only attempted on a candidate if it passes all other checks it is still very heavy. The reasons for this are:
If there are records in the tabix file then the relevant block of the gz file is read, decompressed and passed up from the C to Perl layer... for every read (even if last request was same).
In regions anticipated to have high depth there are many more reads.
Solution
The data to check for overlaps is actually very small, instead of using tabix bindings to verify a read doesn't overlap I've loaded the raw data into a set of IntervalTrees.
Testing the memory footprint of data loaded into IntervalTrees found the overhead to be ~1MB (input file 33K/144K compressed/uncompressed).
Testing
Tested on GRCh38 WXS input BAM of 8.3GB with commands:
# Inputs on lustre but ensure no first read bias
$ cat GRCh38_wxs_6798_PD7261a.bam > /dev/null
# original version
$ /usr/bin/time -v -o pindel_orig.time perl ~/GitHub/cgpPindel_orig/perl/bin/pindel_input_gen.pl -b GRCh38_wxs_6798_PD7261a.bam -o pindel_orig -r dockstore_ref/human_GRCh38/extracted/genome.fa -t 1 -e dockstore_ref/human_GRCh38/extracted/pindel/HiDepth.bed.gz >& pindel_orig.log &
# updated version
$ /usr/bin/time -v -o pindel_new.time perl ~/GitHub/cgpPindel/perl/bin/pindel_input_gen.pl -b GRCh38_wxs_6798_PD7261a.bam -o pindel_new -r dockstore_ref/human_GRCh38/extracted/genome.fa -1 -e dockstore_ref/human_GRCh38/extracted/pindel/HiDepth.bed.gz >& pindel_new.log &
Standard farm node, 32*cpu 256GB RAM, clear of other users/processes.
Timings
Normally 1 million reads are processed in ~30 seconds (on hardware used in test), in regions of high depth this can spike to over 200s (for this WXS data, in WGS data >600s often achieved).
From /usr/bin/time -v, trimmed to useful information
Old
User time (seconds): 7253.41
System time (seconds): 83.32
Percent of CPU this job got: 102%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:59:30
Maximum resident set size (kbytes): 4098176
Exit status: 0
New
User time (seconds): 3888.01
System time (seconds): 85.12
Percent of CPU this job got: 104%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:03:23
Maximum resident set size (kbytes): 4084816
Exit status: 0
New method is ~47% quicker.
Memory less (?!)
The tabix index is held in memory
Decompressed this *.gz.tbi is ~1MB
Result comparison
$ ls -1 pindel_orig/ | xargs -tI {} zdiff -qs pindel_orig/{} pindel_new/{}
zdiff -qs pindel_orig/chr10.txt.gz pindel_new/chr10.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr11.txt.gz pindel_new/chr11.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr12.txt.gz pindel_new/chr12.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr13.txt.gz pindel_new/chr13.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr14.txt.gz pindel_new/chr14.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr15.txt.gz pindel_new/chr15.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr16.txt.gz pindel_new/chr16.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr17.txt.gz pindel_new/chr17.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr18.txt.gz pindel_new/chr18.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr19.txt.gz pindel_new/chr19.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr1.txt.gz pindel_new/chr1.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr20.txt.gz pindel_new/chr20.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr21.txt.gz pindel_new/chr21.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr22.txt.gz pindel_new/chr22.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr2.txt.gz pindel_new/chr2.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr3.txt.gz pindel_new/chr3.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr4.txt.gz pindel_new/chr4.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr5.txt.gz pindel_new/chr5.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr6.txt.gz pindel_new/chr6.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr7.txt.gz pindel_new/chr7.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr8.txt.gz pindel_new/chr8.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chr9.txt.gz pindel_new/chr9.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chrM.txt.gz pindel_new/chrM.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chrX.txt.gz pindel_new/chrX.txt.gz
Files /dev/fd/5 and - are identical
zdiff -qs pindel_orig/chrY.txt.gz pindel_new/chrY.txt.gz
Files /dev/fd/5 and - are identical
feature/intervalTreeSpeedup
GRCh38 WGS data was taking exceptionally long to complete the input generation phase. To the extent that caveman using 60 cpus would complete before pindel (input capped at 4 cpu/input).
TL/DR
Cause
Using Tabix to exclude reads anchored in regions of high-depth.
Although tabix is ideal for testing if an overlap occurs is is not always efficient to do this. Even though it is only attempted on a candidate if it passes all other checks it is still very heavy. The reasons for this are:
Solution
The data to check for overlaps is actually very small, instead of using tabix bindings to verify a read doesn't overlap I've loaded the raw data into a set of IntervalTrees.
Testing the memory footprint of data loaded into IntervalTrees found the overhead to be ~1MB (input file 33K/144K compressed/uncompressed).
Testing
Tested on GRCh38 WXS input BAM of 8.3GB with commands:
Standard farm node, 32*cpu 256GB RAM, clear of other users/processes.
Timings
Normally 1 million reads are processed in ~30 seconds (on hardware used in test), in regions of high depth this can spike to over 200s (for this WXS data, in WGS data >600s often achieved).
From
/usr/bin/time -v
, trimmed to useful informationOld
New
*.gz.tbi
is ~1MBResult comparison
Results are identical.