CSB5 / lofreq

LoFreq Star: Sensitive variant calling from sequencing data
http://csb5.github.io/lofreq/
Other
97 stars 30 forks source link

Approximation shortcut #107

Closed bkille closed 3 years ago

bkille commented 3 years ago

Hello!

This is a PR for an additional heuristic in the LoFreq algorithm. Many times, the Poisson Binomial calculation either yields numbers that are extremely high p-values extremely low p-values. Therefore, we can use an approximation to cut out performing an exact calculation when the p-value is sufficiently above the significance level. By specifying some threshold above the significance level, we can be extremely confident that the approximation shortcut will never skip exact calculation on columns which have significant p-values.

We use a Poisson approximation to the binomial distribution where the mean is the sum of the probabilities and the variable remains the same (most frequent variant count). There are indeed other approximations that we plan on trying, such as the refined normal approximation (see section 3 of this pub for a brief explanation of the approximations). To calculate the Poisson CDF, we use the GNU scientific library. I have not added a --with-gsl flag to the configure (I do not understand .m4 wizardry yet :slightly_smiling_face:), so depending on where it is installed for users, they may need to add the location of the libopenblas.so.0 shared library to their LD_LIBRARY_PATH environment variable in order to run.

bkille commented 3 years ago

Hello again! Just checking in. I have ran LoFreq on a set of different size bam files and the results are below. Diffing between the files shows identical variant calls. I have also increased the cutoff for the p-value (was sigma + 0.01 now sigma + 0.05 ) since it has no noticeable effect on the runtime. The optimization has little effect in the smaller files, as n and k are fairly small, our optimization doesn't even run if the n<100. Please let me know if you have any questions!

File Size Original Runtime (seconds) Improved Runtime (seconds) Speedup
41K 0.03 0.03 1.0
415K 0.24 0.24 1.0
644K 1.69 1.69 1.0
1M 6.33 6.3 1.0
1M 2.61 2.59 1.01
2M 11.62 11.19 1.04
4M 21.82 22.22 0.98
6M 7.0 6.95 1.01
10M 34.22 30.08 1.14
16M 88.08 71.72 1.23
25M 186.29 137.73 1.35
39M 303.53 213.41 1.42
58M 52.69 51.25 1.03
97M 977.3 567.82 1.72
151M 1992.68 1033.87 1.93
237M 3507.36 1583.38 2.22
368M 7425.59 2914.67 2.55
587M 17431.13 5843.03 2.98
935M 52686.3 16059.07 3.28
1G 118481.47 52941.52 2.24
2G 200245.87 43222.88 4.63
3G 645137.02 279217.28 2.31
25G 1495992.48 400483.63 3.74
andreas-wilm commented 3 years ago

Hi Bryce, this is super cool! I have a couple of questions: the file size is only a proxy for coverage (n). Would you be able to generate the same table with coverage added? To that end you can just compute the average on the output of samtools depth -a your.bam . I'll get to testing this on some real date now

andreas-wilm commented 3 years ago

This works perfectly on all high coverage datasets I have access to (all viral samples)! I'll add this as default into to the build process

andreas-wilm commented 3 years ago

Sorry, one question. The returned poibin_approximation is a p-value, right? So I guess we should also use Bonferroni correction as further down below in the code: if (pvalue * (double)bonf_factor > sig_level) {goto free_and_exit;}

and thus change

if (poibin_approximation > sig_level + 0.05) {goto free_and_exit;}

accordingly. Correct? Just double checking...