JosephCrispell / homoplasyFinder

A tool to identify and annotate homoplasies on a phylogeny and sequence alignment
GNU General Public License v3.0
19 stars 3 forks source link

Questions about CI #21

Closed OmonkeyGOD closed 3 years ago

OmonkeyGOD commented 3 years ago

Hi,

I have some questions about the CI calculation. From your paper, you mentioned that > The consistency index for each site in the alignment is then calculated by dividing the minimum number of changes on the phylogeny by the number of different nucleotides observed at that site minus one.

In the example of the program, all the ten homoplasy sites have the MinimumNumberChangesOnTree of 2. Could you explain a little more about this? How do you interpret the minimum number of changes on a tree? All the sites have a CI of 0.5. Does this mean the different nucleotides observed for them are all 5?

In the paper, you also give a diagram demonstrating calculating the tree length of one site in a nucleotide alignment. If you were to calculate the CI of the site showing up in the tree, what is the minimum number of changes, and what is the number of different nucleotides observed? Is the minimum number of changes related to the tree length? Thanks in advance.

JosephCrispell commented 3 years ago

Hi,

Thanks for highlighting that quote:

The consistency index for each site in the alignment calculated by dividing the minimum number of changes on the phylogeny by the number of different nucleotides observed at that site minus one

I need to double check this but I think I have made a mistake here - it is wrong way round or very badly worded. In the code for homoplasyFinder and in the definitions online the consistency index is calculated as:

consistencyIndex <- minNumberOfPossibleChanges / numberChangesObservedOnPhylogeny

where minNumberOfPossibleChanges is the number of nucleotides observed at a site minus 1.

For the results from the example data:

   Position ConsistencyIndex CountsACGT MinimumNumberChangesOnTree
1        57              0.5 0:0:139:16                          2
2       179              0.5  0:151:0:4                          2
3       207              0.5  5:0:150:0                          2
4       241              0.5  6:149:0:0                          2
5       339              0.5  0:0:4:151                          2
6       534              0.5  152:0:0:3                          2
7       559              0.5  0:0:2:153                          2
8       689              0.5 16:139:0:0                          2
9       696              0.5  5:150:0:0                          2
10      771              0.5  0:6:0:149                          2

the minimum number of changes observed on the phylogeny for 10 sites is equal to 2. For each of these sites there are multiple alleles present, for example at position 57 139 of the sequences have a G and 16 have a T so the minNumberOfPossibleChanges is 2 - 1. The numberOfChangesObservedOnPhylogeny is equal to the MinimumNumberChangesOnTree (2). Therefore the consistency index = 1/2 = 0.5.

In the article the tree length for a given site can be considered the numberOfChangesObservedOnPhylogeny - it is the minimum number of changes needed to explain the nucleotides present at the site given the structure of the phylogeny.

OmonkeyGOD commented 3 years ago

Thanks a lot.