ben18785 / Selection_simulations

Simulating Wright-Fisher, Moran and Yule processes.
1 stars 1 forks source link

Inference results #4

Closed ben18785 closed 5 years ago

ben18785 commented 5 years ago

@Armand1 Have a look at the html which contains the inference results for three runs: negative, neutral and positive frequency dependence (you can open in in a browser). As you can see, in all cases, the actual corresponds to the estimated selection coefficient and mutation rate.

I am now going to move onto making the software able to produce each model.

Armand1 commented 5 years ago

@ben18785 I need to unpack your notebook that combines WF series generation and TS inference so that, having generated a bunch of series I can systematically explore the behavior of the inferential model.

I have generated the series I want and put them in a separate csv file, stripped out the generating functions, leaving only the functions necessary for TS inference, and put them in a separate notebook. But there's a problem when I give my new data to: data_stan <- fPrepareAll(aDF_short) it does not like it for unknown reasons. Is there a hidden assumption that, say, generations have to start at 1? Mine start at 4001 so that they're at quasi-stationary state.

I have uploaded my code, html output and data tagged with "AML's unpacking WF TS inference". I think it's something trivial --- would be grateful if you could look.

Also, if easy, can you wrap the stan inference in a function that allows me to look at how good it is given a certain number of generations of data? I thought of looking at 2, 10, 25, 50 and 100 generations of data --- the kind of numbers that one might have in reality. I imagine that the inference would be poor for 2 generations of data, but OK for 100. But it would be good to know. If not, I'll do it the hard way.

Armand1 commented 5 years ago

@ben18785 I have sort-of figured it out. Something in your several data-preparation functions either requires that variants be numbered from 1 or that generations be numbered from 1 (I think it's calculating the number of variants). But I only want to look at the tail end of populations that have been running for a couple of thousand generations and are at steady state --- and in which neither of these are true. Anyway, I have not altered your code, but as a workaround renamed generations and variants starting from 1 and now it works. Results soon!

Armand1 commented 5 years ago

So, here are some results. We look at beta=0 and beta=-0.25 --- moderately strong selection. I want to know how the test performs when we have a limited number of generations of data. Well, the answer is a bit unclear. You obviously need more than 10. But at 25 you can reject neutrality for the selected population. I am a bit surprised to find that, although the estimate --- median of the posterior distribution of beta --- does generally improve with the number of generations assayed, it does not not do so in a very systematic fashion. Would more runs on the Bayesian process or chains improve matters? (My language is not quite right -- but you know what I mean.) If only I had good book on the subject...

Will investigate a bit. What do you think of the plot? How can it be improved? I have put 5 and 95% quantiles as error bars. Is that conventional? timeseriesplot.pdf

Armand1 commented 5 years ago

Here's a question: is it the number of time points assayed that is critical, or the number of generations? If you assay at only, say t=50, 100 and 150, is the power of the test more comparable to 150 generations of data or 3 time points? Indeed, does the test assume that you have assayed every generation? If it is structured like your BCI test then not. It would clearly be more useful if we can assay at few generations spaced far apart.

Armand1 commented 5 years ago

So, a quick test using just the data (three time steps 50 generations apart) gives a tight posterior distribution as if there is lots more data but, perhaps as expected, the estimate of beta is too large (or small --- it's very negative, nearly -1, when it should be -0.25). Moreover, you can't just divide it by 50 to get the right estimate. Some other correction factor must be applied. Still, food for thought.

Incidentally, the inferential model seems constrained to give estimates of beta >-1 and <1. Why is that? It's a slope and, could, in principle be steeper, no?

ben18785 commented 5 years ago

Just at the airport. So am not surprised the test doesn’t work in those cases; it’s only meant for consecutive generations. With more generations, it’ll accentuate the effect of selection, because you are dealing with the aggregate effect. Same applies for using the test on non-consecutive (ie N offspring being born) generations in the Moran model.

On 28 Dec 2018, at 00:05, Armand1 notifications@github.com wrote:

So, a quick test using just the data (three time steps 50 generations apart) gives a tight posterior distribution as if there is lots more data but, perhaps as expected, the estimate of beta is too large (or small --- it's very negative, nearly -1, when it should be -0.25). Moreover, you can't just divide it by 50 to get the right estimate. Some other correction factor must be applied. Still, food for thought.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Armand1 commented 5 years ago

Here's a figure that I quite like. It shows estimates of beta and mu for neutrality and two selection levels given up to 100 generations of data. I am a bit puzzled that the -0.5 treatment does not seem to be converging quite on the true value. I wonder if that needs to be looked at? Incidentally, I boosted the number of iterations to 10k.

time series combo.pdf