Closed maciejmotyka closed 4 years ago
Yes, \eta should be calculated for each population. We set \eta to 0.09 for the whole real HIV-1 dataset.
For your understanding of calculating \eta, it's correct.
Thanks for the quick reply.
For population with average pairwise diversity of 2%, should I make d=0.02
or d=2
?
I wrote a simple R script to calculate eta
, could you please check if I got the math right?
#!/usr/bin/env Rscript
# Assign variables
args = commandArgs(TRUE)
fmin = as.double(args[1])
fmax = as.double(args[2])
err = as.double(args[3])
d = as.double(args[4])
w1 = 5
w2 = 1
# Calculate MEC improvement threshold
a = fmin * d * (1 - 15/16 * (1 - err))
eta1 = a / (a + err)
eta2 = err / 3 * fmax
eta = (eta1 ^ w1 * eta2 ^ w2) ^ (1 / (w1 + w2))
# Print variables and results
sprintf('fmin = %s', fmin)
sprintf('fmax = %s', fmax)
sprintf('err = %s', err)
sprintf('d = %s', d)
sprintf('eta1 = %.4f', eta1)
sprintf('eta2 = %.4f', eta2)
sprintf('eta = %.4f', eta)
For population with average pairwise diversity of 2%, d should be set to 0.02. This also works for err, fmin, fmax. The R script is correct!
By the way, our latest paper "A Graph Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction" may be of interest to you, which is going to appear in the proceedings of AAAI-20. We have upload it to Arxiv and more details can be found on my Github.
Thanks! I'll try GAEseq tomorrow. Happy to see development in QSR.
Regarding my last question:
when I calculate the eta
using d = 0.02
and err = 0.003
I get a value which is an order of magnitude smaller than the values mentioned in the papers (0.09) or used in the sample config files (0.0312 and 0.0395). However, when instead I plug in d = 2
and err = 0.3
I get values in the 'correct' range. That's why I suspected that I need to use the % values.
$ ./calculate_eta.R 0.01 0.8 0.003 0.02
[1] "fmin = 0.01"
[1] "fmax = 0.8"
[1] "err = 0.003"
[1] "d = 0.02"
[1] "eta1 = 0.0043"
[1] "eta2 = 0.0008"
[1] "eta = 0.0033"
$ ./calculate_eta.R 0.01 0.8 0.3 2
[1] "fmin = 0.01"
[1] "fmax = 0.8"
[1] "err = 0.3"
[1] "d = 2"
[1] "eta1 = 0.0224"
[1] "eta2 = 0.0800"
[1] "eta = 0.0277"
Any chance you could share the values used to calculate the 0.09 for HIV dataset?
Below is the python code I used to calculate \eta. It returns 0.0896. Hope it helps!
fmin = 0.1 d = 0.0845 fmax = 0.3 err = 0.001
eta1 = fmin d (1 - 15 / 16 (1 - err)) / (fmin d (1 - 15 / 16 (1 - err)) + err) eta2 = err * fmax / 3 w1 = 5; w2 = 1;
eta = (eta1 * w1 eta2 w2) (1 / (w1 + w2))
print(eta)
Thanks, everything is clear now.
I've read aBayesQR and TenSQR papers. My understanding is that the MEC improvement threshold (
eta
) should be calculated for each population being analyzed. However, I see people using the values found in the sample config files of both programs instead of calculating their own. Also, both papers report using a singleeta
value of 0.09, despite the fact that you were reconstructing populations with varying diversity and strain frequencies.According to my understanding of Appendix B of the aBayesQR paper, to calculate
eta
you need to know (or guesstimate):fmin
- frequency of the least abundant strain in the population (e.g. 0.9)fmax
- frequency of the most abundant strain in the population (e.g. 0.01)e
- error rate (e.g. 0.1)d
- diversity rate (e.g. 0.05) From these, you calculateeta1
andeta2
, and then take a weighted geometric mean to arrive at the finaleta
threshold. The weightsw1
should be5 * w2
and since no absolute values were specified I assumew1 = 5
andw2 = 1
were used.Is my understanding correct?