Calculating MEC improvement threshold

maciejmotyka commented 4 years ago

I've read aBayesQR and TenSQR papers. My understanding is that the MEC improvement threshold (eta) should be calculated for each population being analyzed. However, I see people using the values found in the sample config files of both programs instead of calculating their own. Also, both papers report using a single eta value of 0.09, despite the fact that you were reconstructing populations with varying diversity and strain frequencies.

According to my understanding of Appendix B of the aBayesQR paper, to calculate eta you need to know (or guesstimate):

fmin - frequency of the least abundant strain in the population (e.g. 0.9)
fmax - frequency of the most abundant strain in the population (e.g. 0.01)
e - error rate (e.g. 0.1)
d - diversity rate (e.g. 0.05) From these, you calculate eta1 and eta2, and then take a weighted geometric mean to arrive at the final eta threshold. The weights w1 should be 5 * w2 and since no absolute values were specified I assume w1 = 5 and w2 = 1 were used.

Is my understanding correct?

WuLoli commented 4 years ago

Yes, \eta should be calculated for each population. We set \eta to 0.09 for the whole real HIV-1 dataset.

For your understanding of calculating \eta, it's correct.

maciejmotyka commented 4 years ago

Thanks for the quick reply. For population with average pairwise diversity of 2%, should I make d=0.02 or d=2? I wrote a simple R script to calculate eta, could you please check if I got the math right?

#!/usr/bin/env Rscript

# Assign variables
args = commandArgs(TRUE)
fmin = as.double(args[1])
fmax = as.double(args[2])
err =  as.double(args[3])
d = as.double(args[4])
w1 = 5
w2 = 1

# Calculate MEC improvement threshold
a = fmin * d * (1 - 15/16 * (1 - err))
eta1 = a / (a + err)
eta2 = err / 3 * fmax
eta = (eta1 ^ w1 * eta2 ^ w2) ^ (1 / (w1 + w2))

# Print variables and results
sprintf('fmin = %s', fmin)
sprintf('fmax = %s', fmax)
sprintf('err = %s', err)
sprintf('d = %s', d)
sprintf('eta1 = %.4f', eta1)
sprintf('eta2 = %.4f', eta2)
sprintf('eta = %.4f', eta)

WuLoli commented 4 years ago

For population with average pairwise diversity of 2%, d should be set to 0.02. This also works for err, fmin, fmax. The R script is correct!

By the way, our latest paper "A Graph Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction" may be of interest to you, which is going to appear in the proceedings of AAAI-20. We have upload it to Arxiv and more details can be found on my Github.

maciejmotyka commented 4 years ago

Thanks! I'll try GAEseq tomorrow. Happy to see development in QSR.

Regarding my last question: when I calculate the eta using d = 0.02 and err = 0.003 I get a value which is an order of magnitude smaller than the values mentioned in the papers (0.09) or used in the sample config files (0.0312 and 0.0395). However, when instead I plug in d = 2 and err = 0.3 I get values in the 'correct' range. That's why I suspected that I need to use the % values.

$ ./calculate_eta.R 0.01 0.8 0.003 0.02
[1] "fmin = 0.01"
[1] "fmax = 0.8"
[1] "err = 0.003"
[1] "d = 0.02"
[1] "eta1 = 0.0043"
[1] "eta2 = 0.0008"
[1] "eta = 0.0033"

$ ./calculate_eta.R 0.01 0.8 0.3 2
[1] "fmin = 0.01"
[1] "fmax = 0.8"
[1] "err = 0.3"
[1] "d = 2"
[1] "eta1 = 0.0224"
[1] "eta2 = 0.0800"
[1] "eta = 0.0277"

Any chance you could share the values used to calculate the 0.09 for HIV dataset?

WuLoli commented 4 years ago

Below is the python code I used to calculate \eta. It returns 0.0896. Hope it helps!

fmin = 0.1 d = 0.0845 fmax = 0.3 err = 0.001

eta1 = fmin d (1 - 15 / 16 (1 - err)) / (fmin d (1 - 15 / 16 (1 - err)) + err) eta2 = err * fmax / 3 w1 = 5; w2 = 1;

eta = (eta1 * w1 eta2 w2) (1 / (w1 + w2))

print(eta)

maciejmotyka commented 4 years ago

Thanks, everything is clear now.

SoYeonA / TenSQR

Calculating MEC improvement threshold #6