caitiecollins / treeWAS

treeWAS: A Phylogenetic Tree-Based Tool for Genome-Wide Association Studies in Microbes
Other
92 stars 18 forks source link

Nsim loci simulation and subsequent test understanding #62

Closed DorothyTamYiLing closed 2 years ago

DorothyTamYiLing commented 2 years ago

Hi Caitie,

Thanks for answering my question last time. I am trying to understand more your paper and this time I have two questions:

1) Based on this quote from your paper "Each of the Nsim loci is simulated along the phylogenetic tree, from root to tips, undergoing a number of substitutions drawn from the homoplasy distribution on branches selected randomly with probabilities proportional to branch length". I find it difficult to understand this sentence: is the "homoplasy distribution" a distribution with x-axis being sites on genome and y-axis being degree of homoplasy? How is the degree of homoplasy defined (is higher number of substitution in the site equals higher degree of homoplasy)? Also what is the meaning probability here?

2) in the subsequent test, it measures the proportion of the tree in which the genotype and phenotype co-exist. How it the proportion being measured? Is it measured by the number of branch?

Thanks! Dorothy

DorothyTamYiLing commented 2 years ago

Hi Caitie,

Sorry for messaging you again. It would be great if I can get some insights from you regarding these questions.

Thanks, Dorothy

caitiecollins commented 2 years ago

Hi Dorothy,

No problem, sorry for the delay.

(1) Regarding "homoplasy" and "homoplasy distribution", I have also found both terms to be confusing at times, as they are used to mean different things by different people. In this case, homoplasy is defined as the (minimum) number of substitutions undergone by a given site. The homoplasy distribution is a summary of the number of substitutions undergone by all the sites in the dataset. It's a histogram whose x-axis is the number of substitutions, and y-axis is the frequency of that number of substitutions. There are a couple of them hiding in Supplementary Figure 7 of the treeWAS paper (but I am ashamed to see that they don't have axis labels). Here is one from my thesis: SimBac homoplasy distribution - Screenshot from 2022-07-28 15-17-59

From this you can see, for example, that roughly 3,000 of the genetic loci in this dataset undergo 3 substitutions somewhere along the tree. Each of those sites will undergo those 3 substitutions on a different set of 3 branches, but they will each be more likely to undergo substitutions on longer branches of the tree (as branch length is defined by the number of substitutions that occur along its length).

The quote you are referring to in your question comes from the part of the paper discussing how the null (non-associated) genetic data is simulated. To ensure that this simulated genetic dataset will have the same phylogenetic relationships as the original dataset, we use the original tree as the backbone for these simulations. So, for each simulated genetic locus, we randomly draw the number of substitutions to occur on that site from the homoplasy distribution. Suppose we draw Nsub=5, we then need to assign these 5 substitutions to occur on 5 branches of the tree. To ensure that more substitutions occur on longer branches of the tree and fewer on shorter branches, we select the 5 branches with probability proportional to branch length (eg. if the length of the longest brach is 20% of the total branch length (the sum of the lengths of all the branches in the tree) then we would select that branch 20% of the time, so it is probable that 1/5 substitutions will fall on that branch).

(2) The subsequent test is a little bit complicated to describe mathematically, but generally speaking it would be more correct to say that it represents the proportion of the total branch length (rather than the number of branches) that the genotype and phenotype spend in the same states.

I hope that clears things up a little. Please don't hesitate to ask if you have any further questions. I'll be happy to help. Best, Caitlin.

DorothyTamYiLing commented 2 years ago

Hi Caitie,

Thanks for the explanation, I can understand better now.

Many thanks again! Dorothy