comments on dN/dS and pi

Hi all I report here a very kind email from Chase on few question about dN/dS and pi. I hope it could be useful for the community.

I am using as "within-pool analysis" using vcf format as you know in order to calculate nucleotide diversity (pi) and ratio dN/dS.
In site_results.txt it says:
pi. Nucleotide diversity at this site.
whereas in population_summary.txt:
pi. Mean number of pairwise differences per site in the pooled sample across the whole genome.
For definition, they should be the same thing, isn't? You also used the same keyword. pi_coding and pi_noncoding then refer to the coding and noncoding region I suppose.

The site_results pi is π for a single site in the genome. Most will hopefully be 0. For population_summary, as stated in the definition, it is the mean for the whole genome (all sites), which is not the same. Correct: coding is for regions annotated as protein-coding (in the GTF) and noncoding is for regions NOT annotated as protein-coding (in the GTF).

Then the ratio piN/piS is not automatically calculated but only presented as piN and piS, right?

In cases where it is not calculated for you, you can calculate it yourself as piN/piS (or dN/dS, depending on application).

Finally, dN/dS is not calculated and only mean_dN_vs_ref (and dS) is presented. Is it the same thing?

Good question! They are not the same. piN is the mean of all pairs, i.e., every sequence (or read) to every other. Contrarily, mean dN vs. the reference sequence is the mean of comparisons of the reference sequence (provided in the FASTA) to all other sequences; non-reference sequences are not compared to one another for this (they are only compared to the reference). This result shows just the mean of variation that differs from the reference. It is typically useful when you are interested in divergence from the reference only, not overall diversity in the population.

chasewnelson / SNPGenie

comments on dN/dS and pi #46