Closed maciejmotyka closed 4 years ago
----"however, the config that comes with _GAEseqviral.py doesn't have such recommendation." Thanks for pointing it out. Just set the _SNPthres to 0 in the config file for _GAEseqviral.py.
----"I ran GAEseq_viral.py overnight on i5 CPU with SNP_thresh = 0 and it didn't even complete 50 experiments in 9 hours. And this is with a small genome of 2.7kbp." Yes, it's slow on CPU. GPUs could accelerate the process.
----"Is it beneficial to lower the SNP_thres below this naive estimate?" We set _SNPthres to 0 for viral population reconstruction such that any positions with non-identical nucleotides are considered as SNPs, making sure that no SNPs are ignored. Even if a position is considered as a SNP by mistake due to sequencing errors, it won't affect the reconstruction results.
----"Does it increase neural net quality by giving it more noise to learn to ignore?" No, it doesn't. Noise is actually sequencing error. Large sequencing error would have negative impacts on most reconstruction algorithms.
Ok, so setting SNP_thres = 0
will not hurt the reconstruction with GAEseq. But I believe that all the SNPs that come from just a single read are purely a sequencing error and add a lot of unnecessary computational burden.
I run ExtractMatrix
with a bunch of different SNP_thres
.
SNP_thresh | n of SNPs |
---|---|
0.1 | 55 |
0.05 | 65 |
0.04 | 65 |
0.03 | 95 |
0.02 | 166 |
0.01 | 335 |
0.005 | 362 |
0.004 | 362 |
0.003 | 363 |
0.001 | 363 |
1e-323 | 363 |
1e-324 | 2745 |
0 | 2745 |
I don't know how exactly ExtractMatrix
calculates the SNP frequency, but it seems that all of the SNPs appering around 1e-324 cannot be atributed to anything but sequencing error. My guess is that setting the SNP_thres = 0.003
in this case will drasticaly decrease the computational time without sacrificing any accuracy. I might be totally wrong though.
I see. In this case, it could be fine to set _SNPthres = 0.003 to reduce the computational cost. Setting _SNPthres to 0 enables the detection of very rare viral population but increases the computational time.
I'm trying to reconstruct viral strains from pooled samples. I'm using
GAEseq_viral.py
. The config included withGAEseq_haplo.py
recommends to set theSNP_thres
to "0.5 / ploidy for haplotype assembly and 0 for viral quasispecies", however, the config that comes withGAEseq_viral.py
doesn't have such recommendation.I ran
GAEseq_viral.py
overnight on i5 CPU withSNP_thresh = 0
and it didn't even complete 50 experiments in 9 hours. And this is with a small genome of 2.7kbp.Let's say I'm interested in resolving strains down to
fmin = 0.01
relative frequency. My intuition is that I only need to include SNPs that have frequency of 0.01 and above. To take into account sequencing errors let's lower it by some small margin to 0.009.Is it beneficial to lower the
SNP_thres
below this naive estimate? Does it increase neural net quality by giving it more noise to learn to ignore?