chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
106 stars 37 forks source link

Question on interpreting the output parameters #49

Closed Chongli-90 closed 1 year ago

Chongli-90 commented 3 years ago

Hi Chase!

Thanks for this awesome tool! I have a question regards on an output parameter named "mean_gdiv". I saw it defined as the mean gene diversity but it's still hard for me to differentiate it from the pi. Could you give me a more detailed definition of mean_gdiv, please? Also, does the mean_dN_vs_ref/mean_dS_vs_ref means the same with dN/dS? Thank you so much for your help!

Regards,

singing-scientist commented 3 years ago

Greetings, Chongli! That's a great question. First, we note that gene diversity (expected heterozygosity) at a nucleotide site as h=1-∑x^2, where x takes the frequency of every allele at the site. In other words (for diploids), expected heterozygosity is 1 minus the expected frequency of each homozygote genotype. This is a bit of a remnant of thinking about alleles as larger loci, not at the finer level of individual nucleotide sites.

Nucleotide diversity (π) is only slightly different from h. It is calculated as the mean number of pairwise differences per nucleotide site. Imagine three sequences. Examine one site, suppose seq1=A, seq2=A, and seq3=G. There are three possible pairwise comparisons: seq1 vs. seq2, seq1 vs. seq3, and seq2 vs. seq3. Two of those comparisons differ (A ≠ G), so the mean number of pairwise differences is 2 differences/3 comparisons = 0.67. You divide by 1 because it's only one site, to arrive at π=0.67/1=0.67 per site.

When considering all sites, h and π both estimate the population genetic parameter theta = 4Nu. However, π computes the observed differences (actual pairs) while h computes the "expected" value at infinite N. π is more real-world.

For most purposes, use π. However, h is still useful to compare MEAN gene diversity of POLYMORPHIC sites. For example, one could compare mean gene diversity at polymorphic sites that are nonsynonymous vs. those that are synonymous. Because one takes the mean at polymorphic sites, one does not need to worry as much about ascertainment bias (i.e., whether nonsynonymous sites were more likely to be detected than synonymous sites).

Let me know if that helps!

Chongli-90 commented 3 years ago

Thank you for your reply and that's super helpful! I am using the SNPGenie for my influenza mutations. I already have the πN/πS and I still need parameters to show the mean pairwise distance/divergence, do you think the π or mean_gdiv can be used for presenting that?

I also have a question on using the R script to calculate Tajima's D. If I calculate the Tajima's D for five samples, the number of π I typed in the command should be the π number of the pooled five samples. Am I right?

Thank you so much!

singing-scientist commented 3 years ago

π is the mean pairwise divergence within a deeply sequenced sample. So, if that is what you want to determine, then it's correct. However, if you want to determine divergence between different samples, that is not what you want. Perhaps you could explain exactly what you mean by "the mean pairwise distance/divergence".

I would avoid gdiv unless for the purpose I mentioned above (comparing the mean at N vs. S polymorphic sites).

For Tajima's D (and π), it is on a per-sample basis. So, π should be determined for a single sample, and that can then be used for the Tajima's D input.

I note that the the Tajima's D script was just added and it not extensively tested. Thus, if you're willing, please give it a try and let me know if the results are correct for you!

Thanks! Chase

Chongli-90 commented 1 year ago

Dear Chase,

I am using the R script to calculate Tajima's D estimate for the whole genome of the influenza sample. I have some segregation sites located in two coding regions (which affect two proteins). When I calculate the number of segregation sites as input, should this site be considered as a single site or two sites (since they affect two different protein products)? Thank you so much!

Regards,

singing-scientist commented 1 year ago

Hello, Chongli! That really depends on your question. If you are calculating the statistic for each gene separately, you'd probably want to include it in each each. But if you're calculating one statistic for the whole genome, then it seems best to count each physical site once. I think it will depend on what hypothesis you are testing! Let me know.