ben18785 / Selection_simulations

Simulating Wright-Fisher, Moran and Yule processes.
1 stars 1 forks source link

what is the selection coefficient? #3

Open Armand1 opened 5 years ago

Armand1 commented 5 years ago

Previously, we had a model of selection that looked like this:

x<-sigmoid(logit(1-mu0)+aBeta * p_g0)

where p_g0 is the frequency of a variant in a generation, mu is mutation rate and aBeta is a parameter that shapes the sigmoidal selection function. If we plot "selection" v. p_g0 we see this:

Now, what exactly, is "x" here? It's not a selection coefficient since selection coefficients always depend on the relative fitness of two variants. It's not the frequency of the variant in the next generation. I think that it is an absolute fitness, W.

An absolute fitness, W, is defined as the "proportional" change in abundance over one generation attributable to selection. So, the way this is modeled, W does not change with frequency in neutrality, W decreases with frequency in -5 and -10 and (weirdly) W does not increase at all (or negligibly) with for +5 and +10. But the last is an idiosyncratic product of this scheme.

Absolute fitness.pdf

Let's calculate selection coefficients. Now, as we said S is always relative to some other variant. Let's look at S relative to a variant that has a frequency of 0.5.

s = W_1/W_2 -1. where W_1 and W_2 are the absolute fitnesses.

s_relative to an intermediate allele.pdf

So THAT makes sense. For negative frequency dependence, S is positive at low frequencies and negative at high frequencies. Very much so for -10, less so for -5. The positive selection coefficients are, again, invisible against neutrality at this scale.

Now, theory says that selection overpowers drift when abs(s)> 1/Ne. So, given that Ne =1000, we need s to be > 0.001. Assume that a new variant arises at a frequency of 1/1000. Then, relative to an existing intermediate frequency variant, such a variant will have the following ABSOLUTE selection coefficients:

-10: 0.147 -5: 0.011 0: 0.000 5: 0.0009 10: 0.0009

Which seems to tell us that our negative frequency dependent selected populations had strong enough selection to overpower drift. But our positive selection scheme did not.

BEN: does this make sense? If so, then what we need is something equivalent for the NEW selection scheme. And we need to put in parameter values that are strong enough to change the frequencies. Also, we need to recover selection values.

Now, it is unclear to me what the "s" in propto (1 + s * f) is going to be. I think it is neither a selection coefficient NOR an absolute fitness. It is, again, just a parameter that shapes the function.

ben18785 commented 5 years ago

@Armand1 Thanks for the test issue!

So the one thing I'm not sure about in your above is your W. This is just the probability that ONE parent produces offspring in the next generation. To get the abundance of a variant in the next generation, you need to multiple by the frequency. So do we want?

W <- sigmoid(logit(1-mu0)+aBeta p_g0) p_g0

Following your above, in the new scheme, the abundance in the next generation (on average) for ONE single parent is:

W <- k (1 - mu0) (1 + aBeta* p_g0)

where k is a constant that ensures that the probability has appropriate bounds.

Now calculating the relative selection:

s = W1 / W2 - 1 = (1 + aBeta p_g1) / (1 + aBeta p_g2) - 1

Now relative selection is just linear in p_g1 (holding p_g2 constant) so both positive and negative selections have symmetric effects.

What do you think?

@Armand1 I think that the 's' in propto (1 + s * f) is just a parameter that shapes the function, yes.

Armand1 commented 5 years ago

Let's first deal with the question of how to calculate W and s using the old selection function. We're going to calculate the absolute fitnesses of a rare allele (p_g0 = 0.001) and a common one (p_g0=0.5). We are going t assume mu=0.001 and aBeta = -10. I said:

x<-sigmoid(logit(1-mu)+S * p_g0)

And I also said x = W, the absolute fitness

Let's try some numbers

W_rare<-sigmoid(logit(1-0.001)+-10 0.001) = 0.99899 W_common<-sigmoid(logit(1-0.001)+-10 0.5) = 0.8706541 s=(W_rare/W_common)-1 s =0.1474016

Ben says, no, x is not the absolute fitness (or is so for only one variant -- though I am not sure what he means by that). But, OK, let's just call it x. And then we'll multiply it by p_g0 to get the frequency in the next generation and calculate W from that.

x_rare<-sigmoid(logit(1-0.001)+-10 0.001) = 0.99899 p_g1_rare<-x_rare0.001 = 0.00099899 W_rare = p_g1_rare/0.001 = 0.99899

x_common<-sigmoid(logit(1-0.001)+-10 0.5) = 0.8706541 p_g1_common<-x_common0.5 = 0.4353271 W_common = p_g1_common/0.5 = 0.8706541

s=W_rare/W_common-1 s =0.1474016

So the two ways of calculating s work out exactly the same. Why? Because, by Ben's method, when I calculate p_g1 of some allele, I multiply "x" by p_g0. And, then when I go on to calculate W of that allele I divide p_g1 by p_g0. So p_g0 cancels out.

So, I really do think that "x" is the absolute fitness, W. What's wrong with that?

Armand1 commented 5 years ago

Looking at the new selection function

This the new, improved selection function, and we want to estimate W and s for some variants as before.

This is the selection function itself

x<- k (1 - mu0) (1 + aBeta* p_g0)

We'll use it to get the frequency the next generation, p_g1

p_g1<-selection*p_g0

Then we will get W over a whole range of p_g0

W = p_g1/p_g0

But, wait, what value should I use for k? It's a constant. I'll just simulate it with k=1 and see what happens.

newselectionfunctionW.pdf

So, that looks nice. As Ben says, negative and positive FD are linear and symmetrical. Great!

Now, as before, I am going to look at how the selection coefficient of an allele varies as a function of p_g0, when in competition with an intermediate allele.

s<-W_p_g0/W_0.5-1

selection coefficient new scheme.pdf

That doesn't look right. S is zero for neutral, so that's OK. And S increases as function of frequency for +5 and +10, so that's OK. But surely S should decline for -5 and -10, that is, be positive at low frequencies and negative a high frequencies. What is going on?

Maybe it doesn't like values >1 or <-1. OK. lets try putting in aBetas of

aBeta<-c(-0.1,-0.05,0,0.05,0.1)

newselectionfucntion_selectioncoeffs.pdf

Oh! It likes that. Now s is going all in the right direction, with all treatments s=0 at pg_0=0.5,just as we require. It seems that we have a nice selection function and a nice way of interpreting it.

One last thing: for k=1 (unclear how k affects things --- shifts the Ws up and down but not I think s) These are the selection coeffiencts for a new mutant, relative to a variant at p_g0=0.5

-0.10: 5.252632e-02 -0.05: 2.558974e-02 0.00: 2.220446e-16 0.05: -2.434146e-02 0.10: -4.752381e-02

So with aBeta = -0.05, s=0.025. Which is 25x lot bigger than 1/1000, so I think that means it could invade a population of N=1000 in which the other guys were at intermediate frequency. So we have some a priori plausible values that should give us a selection response. Merry Christmas!

Armand1 commented 5 years ago

A query about the temporal units of beta. For BCI we estimate beta and then I want to convert it into a selection coefficient. Now, selection coefficients are implicitly per generation but is that true for beta? I think not. I think it's per time step which varies between 5 and 10 years, call it 6.5 years average or something. Now, Ryan Chisholm says that tropical tree species have a generation time of something like 30 years. So, does that not suggest we should do something like:

beta(per generation) = 6.5/30 * beta(estimated)?

One reason I say this is that the selection coefficients for frequency independent selection, calculated from the betas, are very strong. If you run a simulation then the fittest species (number 48) goes to fixation in about 10 time steps. Now that simulation, which is just a basic population genetic deterministic sim, assumes that each time step is a generation. But it is not clear to me that the selection coefficients are calibrated in terms of generations.

Does this make sense?

ben18785 commented 5 years ago

Yep, good point and think that the sort of correction you propose makes sense. We should probably include this correction in the inference as well as in the forward simulations too, right?

Armand1 commented 5 years ago

Good. glad that I don't misunderstand this. We can just apply it afterwards, no? But perhaps you're right given that the time steps aren't totally homogenous --- they're something like 5 years, 5,5,5,5,10.

Note that the 30 years per generation number is made up. I doubt that we can get generation times for all species, but it should be possible to justify it more than Ryan says so.

Armand1 commented 5 years ago

getting selection coefficients for BCI data

frequency independent selection.

First we get the absolute fitness, W, of the ith species:

W_i = k \cdot (1 - \mu) \cdot (1 + \beta_i)

where $k$ is a constant and $\mu$ is the estimated immigration rate.

A selection coefficient of species is always based on the absolute fitness of a species relative to another species, typically the most fit one:

w_i=\frac{Wi}{W{max}}

Where W_{m} is the absolute fitness of the most fit species. The selection coefficient is then:

s_i=w_i-1

frequency dependent selection

Now, since we cannot estimate beta for particular species, we've assumed that all species show the same dependence of absolute fitness on frequency. But even so species will have different absolute fitnesses since they differ in average frequency. So we evaluate the absolute fitness of each species at its estimated average frequency, $\bar f_i$:

W_i=k \cdot (1 - \mu) \cdot (1 + \beta \cdot \bar f_i)

Now we could convert these into selection coefficients exactly as above. But, in fact, frequency dependent selection, it turns out that our estimates of beta are negative. That implies that the rarest species are the most fit; and the most common species are the least fit. I think it's more intuitive to estimate s relative to the least fit species, since then it tells us what the selection advantage of a rare invader has relative to the most common species.

w_i=\frac{Wi}{W{min}}

s_i=w_i-1

In our current model, for example, it tells us that a species which is very rare (f ~ 10^-5) has a 3-4% advantage over the most common one (f = 0.15). This is without taking the interval/generation conversion factor into account.

Given that, in this procedure we do, in fact, estimate a different s for each species under frequency dependent selection, it would seem nice if we can do the estimation of s in the model, rather than tacking on afterwards, so that we can get estimates of the uncertainty of s that arises from the uncertainty of the frequencies.

In fact, I am not sure how to calculate uncertainties in s for either case since it depends on the relative fitness, wi, which is a ratio of two quantities, both of which have some uncertainty associated with them. (I have done it --- but just by ignoring the uncertainty in the denominator --- the W{max} or W_{min} --- when calculating relative fitness, which isn't quite right.)

Armand1 commented 5 years ago

Above I said that https://github.com/ben18785/Selection_simulations/issues/3#issuecomment-

*beta(per generation) = 6.5/30 beta(estimated)**. But I think this is upside down. Here is my reasoning:

We have beta_estimated, which is per census interval. To get it per year we divide that value by the number of years between censuses. This is 5.5 (not 6.5):

beta(per year) = beta(estimated)/5.5

Now, we want to get beta per generation.

*beta(per generation)= beta(per year) 110**

generation time being 110 years (not 30)

So,

*beta(per generation)= beta(estimated) /5.5 110**

Now, actually if we use the same generation time for all species, 110 years, then none of this affects the calculation of absolute fitness, W since it just scales beta, and hence the absolute fitnesses, _W_by a constant. And since relative fitness, w, and selection coefficients, s , are based on the relative values of W nothing changes. So the selection coefficients that I have estimated are fine.

But we could also allow generation times to vary among species. If they do, it will have an important effect on the dynamics. Here is why. Suppose we find that species x has a 1% selective disadvantage per generation relative to the most fit species; and species y has a 10% selective disadvantage per generation. If we assume that they have the same generation times, we would say that species y will go extinct more quickly than species y. But if now I tell you that species x has a generation time of 1 year, but species y has a generation time of 1000 years, it seems that species x will go extinct first in chronological time.

We have generation times for each species -- and they vary by nearly two orders of magnitude. So we should probably use them.

Armand1 commented 5 years ago

Looking back at the models, I think we need the following parameter estimates for completeness --- so that I can do anything I want with them.

In terms of intervals Beta_i (with CIs)

Normalized by year W_i (absolute fitness, with CIs) w_i (relative fitness, with CIs) s_i (selection coefficient, with Cis)

Normalized by generation W_i (absolute fitness, with CIs) w_i (relative fitness, with CIs) s_i (selection coefficient, with Cis)

And:

mu --- the immigration rate.

Note the following

1) the relative absolute fitnesses will change depending on whether you scale the betas by generation or not. So the fittest species will change too. And since the fittest species is our standard, the relative fitnesses and selection coefficients will change too. That's the way it should be: the best species in terms of generations will not be the best in terms of years.

2) we want the relative fitnesses and selection coefficients scaled to the median value of the fittest species (not the most extreme value). That is, we want the estimated relative fitness of the fittest species to be 1 and estimated selection coefficient to be zero. Not negative as it currently is.

3) To get the normalized by year selection coefficients, divide by 4.7 not 5.5 since we now have 2005 data.

Thinking about it, I was not so foolish to vacillate between annual and generational selection coefficients. That's because they have different uses.

If we want to compare selection coefficients internally to BCI then we need annual selection coefficients. They will most directly predict what is happening in the system: which species are winning and which are losing.

But if we want to compare selection coefficients externally then we need generational selection coefficients. For example, if we want to use population genetic theory results. Or else just say selection is "strong" or "weak" compared to estimates from other systems. It is striking to me that our yearly selection coefficients are really strong. But our generational selection coefficients are much more modest.

Armand1 commented 5 years ago

Hold off re-running the simulations until I give you a new set of data. Reading something in the literature tells me that the abundances of one or two species need to be adjusted in a particular way. Will give you a new dataset by today pm.

ben18785 commented 5 years ago

Hi Armand,

Ok, will do! Thanks again for the weekend BBQ -- as always, swashbuckling fun! Claire very much enjoyed herself.

Best,

Ben

On Mon, Jul 22, 2019 at 3:26 PM Armand1 notifications@github.com wrote:

Hold off re-running the simulations until I give you a new set of data. Reading something in the literature tells me that the abundances of one or two species need to be adjusted in a particular way. Will give you a new dataset by today pm.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ben18785/Selection_simulations/issues/3?email_source=notifications&email_token=ABCILKCIXPRJKEEOLH4CKW3QAW7SXA5CNFSM4GL7IU7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2QDBLA#issuecomment-513814700, or mute the thread https://github.com/notifications/unsubscribe-auth/ABCILKEB5YBEIWJUHGD4ITLQAW7SXANCNFSM4GL7IU7A .