WuLoli / GAEseq

GAEseq: A Graph Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction
7 stars 0 forks source link

QSR: setting SNV threshold #1

Closed maciejmotyka closed 4 years ago

maciejmotyka commented 4 years ago

I'm trying to reconstruct viral strains from pooled samples. I'm using GAEseq_viral.py. The config included with GAEseq_haplo.py recommends to set the SNP_thres to "0.5 / ploidy for haplotype assembly and 0 for viral quasispecies", however, the config that comes with GAEseq_viral.py doesn't have such recommendation.

I ran GAEseq_viral.py overnight on i5 CPU with SNP_thresh = 0 and it didn't even complete 50 experiments in 9 hours. And this is with a small genome of 2.7kbp.

Let's say I'm interested in resolving strains down to fmin = 0.01 relative frequency. My intuition is that I only need to include SNPs that have frequency of 0.01 and above. To take into account sequencing errors let's lower it by some small margin to 0.009.

Is it beneficial to lower the SNP_thres below this naive estimate? Does it increase neural net quality by giving it more noise to learn to ignore?

WuLoli commented 4 years ago

----"however, the config that comes with _GAEseqviral.py doesn't have such recommendation." Thanks for pointing it out. Just set the _SNPthres to 0 in the config file for _GAEseqviral.py.

----"I ran GAEseq_viral.py overnight on i5 CPU with SNP_thresh = 0 and it didn't even complete 50 experiments in 9 hours. And this is with a small genome of 2.7kbp." Yes, it's slow on CPU. GPUs could accelerate the process.

----"Is it beneficial to lower the SNP_thres below this naive estimate?" We set _SNPthres to 0 for viral population reconstruction such that any positions with non-identical nucleotides are considered as SNPs, making sure that no SNPs are ignored. Even if a position is considered as a SNP by mistake due to sequencing errors, it won't affect the reconstruction results.

----"Does it increase neural net quality by giving it more noise to learn to ignore?" No, it doesn't. Noise is actually sequencing error. Large sequencing error would have negative impacts on most reconstruction algorithms.

maciejmotyka commented 4 years ago

Ok, so setting SNP_thres = 0 will not hurt the reconstruction with GAEseq. But I believe that all the SNPs that come from just a single read are purely a sequencing error and add a lot of unnecessary computational burden.

I run ExtractMatrix with a bunch of different SNP_thres.

SNP_thresh n of SNPs
0.1 55
0.05 65
0.04 65
0.03 95
0.02 166
0.01 335
0.005 362
0.004 362
0.003 363
0.001 363
1e-323 363
1e-324 2745
0 2745

I don't know how exactly ExtractMatrix calculates the SNP frequency, but it seems that all of the SNPs appering around 1e-324 cannot be atributed to anything but sequencing error. My guess is that setting the SNP_thres = 0.003 in this case will drasticaly decrease the computational time without sacrificing any accuracy. I might be totally wrong though.

WuLoli commented 4 years ago

I see. In this case, it could be fine to set _SNPthres = 0.003 to reduce the computational cost. Setting _SNPthres to 0 enables the detection of very rare viral population but increases the computational time.