Closed bkille closed 3 years ago
Hello again! Just checking in. I have ran LoFreq on a set of different size bam files and the results are below. Diffing between the files shows identical variant calls. I have also increased the cutoff for the p-value (was sigma + 0.01
now sigma + 0.05
) since it has no noticeable effect on the runtime. The optimization has little effect in the smaller files, as n
and k
are fairly small, our optimization doesn't even run if the n<100
. Please let me know if you have any questions!
File Size | Original Runtime (seconds) | Improved Runtime (seconds) | Speedup |
---|---|---|---|
41K | 0.03 | 0.03 | 1.0 |
415K | 0.24 | 0.24 | 1.0 |
644K | 1.69 | 1.69 | 1.0 |
1M | 6.33 | 6.3 | 1.0 |
1M | 2.61 | 2.59 | 1.01 |
2M | 11.62 | 11.19 | 1.04 |
4M | 21.82 | 22.22 | 0.98 |
6M | 7.0 | 6.95 | 1.01 |
10M | 34.22 | 30.08 | 1.14 |
16M | 88.08 | 71.72 | 1.23 |
25M | 186.29 | 137.73 | 1.35 |
39M | 303.53 | 213.41 | 1.42 |
58M | 52.69 | 51.25 | 1.03 |
97M | 977.3 | 567.82 | 1.72 |
151M | 1992.68 | 1033.87 | 1.93 |
237M | 3507.36 | 1583.38 | 2.22 |
368M | 7425.59 | 2914.67 | 2.55 |
587M | 17431.13 | 5843.03 | 2.98 |
935M | 52686.3 | 16059.07 | 3.28 |
1G | 118481.47 | 52941.52 | 2.24 |
2G | 200245.87 | 43222.88 | 4.63 |
3G | 645137.02 | 279217.28 | 2.31 |
25G | 1495992.48 | 400483.63 | 3.74 |
Hi Bryce, this is super cool! I have a couple of questions: the file size is only a proxy for coverage (n
). Would you be able to generate the same table with coverage added? To that end you can just compute the average on the output of samtools depth -a your.bam
. I'll get to testing this on some real date now
This works perfectly on all high coverage datasets I have access to (all viral samples)! I'll add this as default into to the build process
Sorry, one question. The returned poibin_approximation
is a p-value, right? So I guess we should also use Bonferroni correction as further down below in the code:
if (pvalue * (double)bonf_factor > sig_level) {goto free_and_exit;}
and thus change
if (poibin_approximation > sig_level + 0.05) {goto free_and_exit;}
accordingly. Correct? Just double checking...
Hello!
This is a PR for an additional heuristic in the LoFreq algorithm. Many times, the Poisson Binomial calculation either yields numbers that are extremely high p-values extremely low p-values. Therefore, we can use an approximation to cut out performing an exact calculation when the p-value is sufficiently above the significance level. By specifying some threshold above the significance level, we can be extremely confident that the approximation shortcut will never skip exact calculation on columns which have significant p-values.
We use a Poisson approximation to the binomial distribution where the mean is the sum of the probabilities and the variable remains the same (most frequent variant count). There are indeed other approximations that we plan on trying, such as the refined normal approximation (see section 3 of this pub for a brief explanation of the approximations). To calculate the Poisson CDF, we use the GNU scientific library. I have not added a
--with-gsl
flag to the configure (I do not understand.m4
wizardry yet :slightly_smiling_face:), so depending on where it is installed for users, they may need to add the location of thelibopenblas.so.0
shared library to theirLD_LIBRARY_PATH
environment variable in order to run.