Question Regarding Effective Population Size

SriRaj34 commented 1 year ago

Hello,

I hope all is well! We have been very interested in using phylonco, specifically the binary substitution model and error model. We have first simulated data with a known mutation rate and population size using CellCoal, and are interested in applying the model on the simulated data.

We simulated data according to exponential growth, and are using the following priors:

Strict molecular Clock (with the mutation rate we simulated the data with, 1E-05)
Coalescent Exponential Growth Model

When we run the binary substitution model and error model with these priors, we end up with a final lambda value of 82. In our simulated data, we gave all possible sites(including invariable sites) and did not perform any ascertainment bias correction (with constantSiteWeights). Our effective population size is around 5000, which after correcting for diploid (meaning this is 2x), would be an effective cell population size of 2500.

We have attached our test.xml file and log file for your reference, and would be grateful for any insight you have. Specifically, is there a prior we should be using for our lambda value that makes more sense? Thank you in advance!

Sri and Tamara

files_test.zip

alexeid commented 1 year ago

Thanks for your interest in our package. What is the true model for your simulation? Is your true substitution model binary, with a lambda parameter, or something else? What was the true growth rate and final population size? I assume that in your simulated data there was no sequencing/amplification error or allelic dropout?

alexeid commented 1 year ago

Just looking at your data, the vast majority are zeros, so it will be the case that lambda must be very large to obtain such an equilibrium distribution. If a condition of the inference was starting at all zeros at the root then you could get a very different result. However standard phylogenetic likelihood calculations integrate over every possible sequence at the root and assume that the root sequence is at equilibrium for the given continuous-time Markov process (CTMC).

If you started your simulation at all zeros at the root, but then employed a time-reversible CTMC, then this simulation would not be consistent with the standard model assumptions for phylogenetic likelihood calculations. So the lambda will not necessarily match the true value in such a case. This reflects a form of model misspecification.

bioDS / beast-phylonco

Question Regarding Effective Population Size #36