chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
106 stars 37 forks source link

Missing sites in FASTA files #27

Closed sialsaffar closed 4 years ago

sialsaffar commented 4 years ago

Dear Chase,

Thank you for writing this tool, It is very helpful and I learned much from it.

If you please, I have a question regarding missing sites representation. I have a multi-sample VCF file generated with GATK in which some samples have missing sites reported for some variants. I want to generate a multi-sequence alignment in FASTA format in order to use it with "snpgenie_within_group.pl" to calculate dN/dS per gene. Do I have to use the REF base to substitute missing sites (not recommended) or is there a way for SNPGenie to not take those sites into consideration (e.g. using "-" to denote a missing site)?

If my samples are diploid and come from one population, will there be a difference between reported dN/dS from "snpgenie_within_group.pl" and piN/piS if I wanted to use "snpgenie.pl" with --vcfformat=1 instead?

Sincerely, Samer

singing-scientist commented 4 years ago

Thanks very much for the kind words and for using SNPGenie, Samer!

I agree with your assessment that using REF for missing sites is a bad idea. Either '-' or 'N' will work for that purpose when using snpgenie_within_group.pl. The codon(s)/sequence(s) in which the '-' or 'N' occurs will be ignored when they are used in a pairwise comparison (i.e., the "pairwise deletion" method). Note that the data of all the other sequences will still be used, if they contain completely defined codons (i.e., codons not containing an '-' or 'N'). In general, I recommend making sure that a codon site has at least 6 sequences with fully defined codons before 'allowing' it to contribute to dN and dS values. In other words, if some codon positions comprise mostly missing data, it can be advisable to exclude them from your overall dN/dS estimate(s) because they may not be reliable.

Indeed, using a VCF with snpgenie.pl should yield the same answer (save possible rounding error), and πN and πS are simply the mean pairwise dN and dS values for a population, respectively. I recommend using the one for which input file preparation is easiest. I almost always use the within-group script when I have a choice.

Let me know if that helps!

Yours, Chase

sialsaffar commented 4 years ago

Thank you very much for your prompt and thorough reply! Exactly what I wanted to hear :)

Sincerely, Samer

singing-scientist commented 4 years ago

My pleasure! I will close this issue for now; please feel free to re-open if you have any follow-ups.