The outer parallelization of calling variants with the pileup model using GNU parallel is inefficient for targeted amplicon BAM files where large portions of contigs have no supporting reads. In this case, the outer parallelization strategy spins up many processes that use CPU but do no work because they are assigned regions of a contig where there is nothing to do.
This outer parallelization also causes inefficiency in small, targeted amplicon panels where we tend to run analysis in parallel on many samples, so we want to reduce the number of threads a single sample takes to make room for others. When many of the threads do nothing, this overhead has a higher cost where nothing is happening for a sample for long periods.
To address this I've added new behavior that's invoked by setting chunk_num to -1 (0 already had a special meaning). In this case outer parallelization is disabled and replaced with inner parallelization within tensorflow. All candidates are batched within a single process. If a bed file is provided we achieve further speedups by only looking for candidates in the regions defined by the bed file
Using an in-house targeted amplicon bed file for profiling, I compare the original chunk_num==0 behavior to the new behavior in both the single threaded and threads=4 cases. Note that runtime is the entire analysis, not just the optimized pileup process.
chunk_num
bed provided
threads
wall clock execution time
0
no
1
9m
-1
no
1
1m 19s
-1
yes
1
43s
0
no
4
2m 47s
-1
no
4
51s
-1
yes
4
26s
Note that chunk_num==-1 is not appropriate for whole genome analysis because it uses too much RAM. I have an additional commit that fixes that issue but the the old behavior is still significantly faster than the new behavior when there is broad enough coverage for the GNU parallel threads to be fully utilized
I've made this change only for the Cffi portion of the code because that's what we use.
The outer parallelization of calling variants with the pileup model using GNU parallel is inefficient for targeted amplicon BAM files where large portions of contigs have no supporting reads. In this case, the outer parallelization strategy spins up many processes that use CPU but do no work because they are assigned regions of a contig where there is nothing to do.
This outer parallelization also causes inefficiency in small, targeted amplicon panels where we tend to run analysis in parallel on many samples, so we want to reduce the number of threads a single sample takes to make room for others. When many of the threads do nothing, this overhead has a higher cost where nothing is happening for a sample for long periods.
To address this I've added new behavior that's invoked by setting chunk_num to -1 (0 already had a special meaning). In this case outer parallelization is disabled and replaced with inner parallelization within tensorflow. All candidates are batched within a single process. If a bed file is provided we achieve further speedups by only looking for candidates in the regions defined by the bed file
Using an in-house targeted amplicon bed file for profiling, I compare the original chunk_num==0 behavior to the new behavior in both the single threaded and threads=4 cases. Note that runtime is the entire analysis, not just the optimized pileup process.
Note that chunk_num==-1 is not appropriate for whole genome analysis because it uses too much RAM. I have an additional commit that fixes that issue but the the old behavior is still significantly faster than the new behavior when there is broad enough coverage for the GNU parallel threads to be fully utilized
I've made this change only for the Cffi portion of the code because that's what we use.
fixes #306