Understanding how within-between group works.

migrau commented 4 years ago

When using the normal snpgenie (WITHIN-POOL ANALYSIS), it uses a reference fasta and a list of variants to calculate the Nonsynonymous/Synonymous. However I'm a bit confused with the within-groups and between-groups because the inputs are only a fasta alignment and the annotation file (which contains only the CDS annotations), without a reference. It's comparing every sequence vs all for each position? Many thanks,

singing-scientist commented 4 years ago

Many thanks for the question and for using SNPGenie! You are exactly right. The within- and between-group scripts compute traditional mean dN and dS for all pairwise comparisons of sequences in the alignment. For within-group, this is the mean of all nC2 = (n^2 - n) / 2 pairwise comparisons within an alignment of n sequences. For between-group, this is the mean of all n*m pairwise comparisons between 2 alignments with n and m sequences, respectively. In other words, the between-group script identifies which "group" a sequence belongs to by which FASTA file it comes from. Please let me know if this helps.

Yours, Chase

joanmarticarreras commented 3 years ago

Hi,

Following the topic, within-groups dN/dS seems rather intuitive: Mean selective coefficient within the group. If I have 10 variants of the HIV gag gene, I can get which is the dN/dS of the gene taking into account the gene diversity that I have access to.

For between-groups dN/dS I have a bit more of a problem. Are the groups compared? How? Is it just like a within-group comparison but stratifying by files? A way to parallelize pair-wise comparisons and compute dN/dS?

If I want to compare differences and/or dN/dS between 2 group of sequences will it assist on that?

Many thanks!

Joan

singing-scientist commented 3 years ago

Hello @joanmarticarreras ! Thanks for the ideas. First, I think it is important to note that dN/dS is not equivalent to a selection coefficient (s). Most literally, dN/dS can be interpreted as the probability of fixation of a nonsynonymous mutation compared to a synonymous mutation, where synonymous mutations are assumed to be neutral (or relatively neutral, as compared to nonsynonymous ones). Moreover, there is often a dN/dS value for a region or gene, in which case it summarizes many different variants, each of which might have its own s. It is also not necessarily equivalent to gene diversity. Given these points, I don't think I understand the first point.

The between-group method is described in the documentation here: https://github.com/chasewnelson/SNPGenie#snpgenie-between

It is equivalent to "inter-populational" diversity, originally described by Nei and Li: https://pubmed.ncbi.nlm.nih.gov/291943/

It is also equivalent to the "between-group" method used in the MEGA software. In short, if there are two groups, say of 5 and 9 sequences, then all 5*9=49 comparisons between (but not within) the two groups are performed to estimate dN/dS. Let me know if this makes sense!

chasewnelson / SNPGenie

Understanding how within-between group works. #28