chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
100 stars 37 forks source link

problem with minfreq #74

Open wantingwei opened 5 months ago

wantingwei commented 5 months ago

I am recently running snpgenie within pool analysis, using command snpgenie.pl --minfreq=0.01 --snpreport=${sample}_snpgenie.vcf --vcfformat=2 --slidingwindow=30 --fastafile=./seqs/MN908947_3.fasta -- gtffile=./seqs/MN908947_3.gff.txt. I want to exclude all the variants below 1% frequency so using minfreq=0.01, however I realized all the mutations with frequency >0.99 have been excluded. I am not sure why this is happening, can someone help me figure it out. I attached the screen shot of the script out put and my vcf file. Screenshot 2024-02-12 at 1 16 17 PM Screenshot 2024-02-12 at 1 17 36 PM

singing-scientist commented 5 months ago

Greetings, Wanting! Thanks for this question. In fact, this is the desired behavior. For example, suppose there is a polymorphic site with C=9999 and T=1. The frequency of T is 1 / (9999 + 1) = 0.0001, or 0.01%, and it doesn't matter whether it's the REF or ALT allele. If C is the REF allele, then the VCF will report ALT as T with AF=0.0001; alternatively, if T if the REF allele, then the VCF will report ALT as C with AF=0.9999. However, in both cases there is an allele (T) with a frequency below your cut-off, and it will be eliminated. Put another way, the frequency filter will eliminate an allele regardless of whether it is REF or ALT. Note that which allele is REF is often arbitrary (e.g., a standard reference sequence against which variants were called). Let me know if this makes sense.

If there's a reason you wanted to filter out low-frequency ALT alleles only (but not low-frequency REF alleles), you could filter your VCF file on your own before submitting it to SNPGenie, and then simply not use --minfreq. However, in most circumstances I don't think it makes sense to give low-frequency REF alleles a 'free pass' but eliminate low-frequency ALT alleles. But, it's possible your situation is an exception!

Hope this helps, let me know! Chase

wantingwei commented 5 months ago

Thank you for response so fast. Just want to confirm that I understand correctly if C is the REF with 0.99% and T is ALT 0.01%. The T will be filtered out but C will remain in the calculation.

singing-scientist commented 5 months ago

My pleasure! It's correct that, in this example, T will be filtered out and C will not be filtered out. However, regarding calculation, it depends what you mean. If you mean π, there is no longer a 'calculation', i.e., the site will no longer be polymorphic so its π = 0. Let me know!

wantingwei commented 5 months ago

Gotcha. I am trying to compare piN/piS to access selective pressure for my sample. I am sequencing omicron samples, that call variants based on wuhan-1, in that case should I consider those "fixed or 99%" mutation. Or it would be more reasonable to filter the vcf without lower variant and using default minfreq in the program.

singing-scientist commented 5 months ago

Understood! If you're interested in constraint then πn/πs is the way to go. To me, your question is more about how to make sure to quality filter variants — using minfreq is just one overly simple way to do that. Indeed, when you use a minfreq cutoff of 0.01, you are effectively saying you believe that anything with AF < 1% isn't real, i.e., is a sequencing error.

I see you're working with Tom — if you have questions about study design and QC that's probably beyond the ambit of SNPGenie and GitHub, but feel free to send an email to us! Of interest, we just published a program called VCFgenie to perform QC filtering of iSNVs that you may wish to use:

Let me know! C