HighlanderLab / SIMplyBee

SIMplyBee R package extends AlphaSimR for simulating honeybee populations and breeding programmes
http://www.simplybee.info/
Other
0 stars 5 forks source link

Demographic model #125

Closed janaobsteter closed 2 years ago

janaobsteter commented 2 years ago

image This is the last model I had modelled by the Wallberg paper. The Ne of the ancestral population is dubious since they don't specify if directly. Another missing link the CO ancestral population. I will have another look at the paper.

janaobsteter commented 2 years ago

image This is the main source of the data in the paper:

image

`

"The effective population size was computed from the levels of sequence variation in each population. The mutation rate based on the divergence between A. mellifera and A. cerana (5.27 × 10–9) was used to transform ages into units of time, assuming a generation turnover of 1 year. "

`

janaobsteter commented 2 years ago

And these are the PSMC results with a) complete data set b) transversions only image

janaobsteter commented 2 years ago

Also important:

Here we used a genealogical concordance method23 to estimate the divergence times between the A, C and M groups in the range of 0.59–0.98 × 1.5Ne generations.

gregorgorjanc commented 2 years ago

This is the last model I had modelled by the Wallberg paper. The Ne of the ancestral population is dubious since they don't specify if directly. Another missing link the CO ancestral population. I will have another look at the paper.

How did you get the Ne of 350K on the top of your slide? Average? Hmm, (500+210+300+189)/4= ~300, looks like not.

Let's not get perfect here. Let's focus on only M and C lineage for now - it's enough for what we need just now. With this in mind, we need ancestral Ne for these two and the timing of their divergence/split. Any good sources on this?

Looking at the left plot in https://github.com/HighlanderLab/SIMplyBee/issues/125#issuecomment-1015439032 it seems that Ne is quite constant for these two lineages, though only in the range ~50K, while for the right plot with transversions only we see again way lower estimates than with other methods, but up to ~15x increase (to ~10-15x10^4, so 100,000-150,000) around 10^5 years ago. Hmm, this is not that bad compared to other estimates, but other estimates are supposed to be contemporary. Argh ... Would it be possible, they have made a typo, and the 10^4 should in fact be 10^5 in that plot? Oh well, who reviewed that paper!?

What about the following: 1) M north has the contemporary Ne of ~160K 2) M south has the contemporary Ne of ~180K and they diverged ~13K year ago with the ancestral Ne of ~210K

3) C carnica has the contemporary Ne of ~170K 4) C ligustica has the contemporary Ne of ~170K and they diverged ~25K year ago with the ancestral Ne of ~190K

C and M have diverged ~300K years ago with the ancestral Ne of ~200K.

I am not convinced these are indeed contemporary Ne!

So, they all have Ne of ~170K, give or take. The only difference is time of divergence. We could do the following. Simulate three bee populations, say M, C_carnica, and C_ligustica, with contemporary Ne of 170K (I still find this crazy high, for any domesticated population, but maybe this is OK!?), and divergence between C_carnica and C_ligustica at 25K years ago with Ne of 190K, and divergence between C and M at 300K years ago with Ne of 200K. Even if you do one step change at 25K years ago it will be fine given that we are going from ~200K to only ~170K! @janaobsteter can you implement this in simulateHoneybeeGenome() for now?

gregorgorjanc commented 2 years ago

See #45 - just linking up similar issues

janaobsteter commented 2 years ago

So, I got the 350,000 ancestral, since the Ne at the most ancestral splits (A-C, A-M, A-O)* was around 350,000 image

Regarding the numbers, it was not a typo, I contacted the authors at that point and they told me the PSMC is very sensible to any missing data.

Regarding the plan proposed above - I do agree we simulate just the two lineages, but then I think we should include the C-O split or at least the M-lineages split, to be consistent and to include the entire demographic history of the two lineages.

gregorgorjanc commented 2 years ago

As discussed, we need something like possible honeybee demography, for now, to get our other simulations up and running and once you infer a better demography we implement that.

janaobsteter commented 2 years ago

OK, so I think I've managed to decipher the ms parameters and build the models. I've simulated four subpopulations it was just easier for me for now): mellifera S (M), mellifera N (M), carnica (C), and ligustica (C). First I've merged the M lineage subspecies (-eJ flag, subpopulation 2 into 1) and changes the Ne, and then did the same for the C lineage (subpopulation 4 into 3). At the end, I merged the two lineages into one (subpopulation 3 into 1) ancestral population. What I still don't know it the nsam and nrep, but as @gregorgorjanc said, it might be some diferences between ms and MaCS.

image

janaobsteter commented 2 years ago

For the M lineage, there two splits of the subspecies - but I only accounted for one (13kya), since the second one at 37kya kept the same Ne. Also, with this Wallberg paper and the "point" model, we only have the Ne at splits or merges, but we don't know what happened to the subspecies in the "mean" time

gregorgorjanc commented 2 years ago

@janaobsteter MaCS call is

macs nsamp nbp ...

where nsamp is number of simulated haplotypes/chromosomes/DNA_fragments and nbp is the size of the simulated haplotype/chromosome/DNA_fragment and ... are the other ms flags

I never understood what nrep is in ms.

Yeah, we don't have any information about what happens between the splits so we will have to go with an instantaneous change (via -en).

You have one time point too much given that there is no change in Ne in the M lineage.

Should -I 2 ... be -I 4 ...?

I am struggling to keep track / parse the whole ms call in the above screenshot - could you somehow partition/annotate it @janaobsteter ?

janaobsteter commented 2 years ago

Sure! So, if these are the events of the two lineages, C and M,

image

This is the model

image

gregorgorjanc commented 2 years ago

@janaobsteter very clear now - great! I am now looking at runMacs2(), but I think you won't be able to implement this model with it - I think you have to use runMacs(manualCommand=..., manualGenLen=...) - can you try it out and report progress here?

janaobsteter commented 2 years ago

@gregorgorjanc , I am struggling with this a bit. In AlphaSimR, the runMacs() function runs this line if the mannualCommand is given: command = paste0(popSize," ",manualCommand," -s ")

But I'm not quite sure what this manual command should include ... Certainly not the number of individuals, but does it include the number of segSites? For other species, the command starts like this (example CATTLE): "100 1E8 -t 9E-6 -r 3.6E-6 -eN 0.011 1.33...". So popSize nSegSitesand then ms commands. I can't try it on my computer since Rstudio just crashes, but I am running the script below:

nInd = 200
nSegSites = 1e08
nChr = 1
genLength = c(6.6332538, 3.958019952, 3.282286245, 3.6996284759999996, 2.9739453740000004, 3.771289624, 3.322495332, 2.65789689, 3.039244146, 3.065292896, 3.3195778000000002, 2.441017608, 2.639454948, 2.625027132, 2.107127594, 1.650385296)

nMelS = 50
nMelN = 50
nCar = 50
nLig = 50

command = paste0(nSegSites, " -t 0.002312 -r 0.15776 -I 4 ", nMelS, " ", nMelN, " ", nCar, " ", nLig, " -ej 0.019 2 1 -en 0.019 1 1.235 -ej 0.037 4 3 -en 0.037 3 1.118 -ej 0.441 3 1 -en 0.441 1 2.059")

runMacs(nInd = nInd,
        nChr = nChr,
        segSites = nSegSites,
        manualCommand = command,
        manualGenLen = mean(genLength))
gregorgorjanc commented 2 years ago

Yeah, we are calling an internal C++ function and if we form the command wrong it will crash.

Looking at the runMacs() code I see

macsOut = MaCS(command, segSites, inbred, ploidy, nThreads, seed)

so segSites should not be a part of command.

gregorgorjanc commented 2 years ago

Digging further, I see

Rcpp::List MaCS(Rcpp::String args, arma::uvec maxSites, bool inbred, 
                arma::uword ploidy, int nThreads, Rcpp::StringVector seed){
...
  macsOutput = runFromAlphaSimR(args+seed(chr));
...

So, the command is args apart from seed = everything that is passed to an usual MaCS call, which is

macs nHap nBp -t theta ...

So, I think command should be nHap nBp -t theta ...

janaobsteter commented 2 years ago

Yes, this is what I thought as well - but it keeps crashing. Maybe I can show this at today's meeting

janaobsteter commented 2 years ago

image This is another estimate of the current Ne from Dogantzis et al., 2022