bodkan / slendr

Population genetic simulations in R 🌍
https://bodkan.net/slendr
Other
54 stars 5 forks source link

Controlling mutation rate? #140

Closed ivanscotti closed 1 year ago

ivanscotti commented 1 year ago

Dear Martin,

I’m trying to run a simulation using slendr but there is a parameter that I can set in SLiM, but for which I see no argument in the compile_model() or slim() functions in slendr: mutation rate.

How do you control it in the package?

many thanks for your help, and congrats for the very useful tool!

Cheers

Ivan

bodkan commented 1 year ago

Hi @ivanscotti,

Thanks for being interested in slendr!

The answer to your question is very simple -- although it might surprise you if you worked with more traditional simulation approaches before -- there is no such parameter for compile_model() or slim() functions!

The reason for this is that slendr embraces as its sole simulation output a so-called "tree sequence", sometimes also called ARG (ancestral recombination graph). There is now a wealth of online resources on this topic, so I will just leave you with the most important website, which is the homepage of the tskit project. In particular, if you've never heard about a tree sequence before, perhaps you could start here. For more rigorous, in-depth introduction I recommend a paper by Kelleher et al., 2018 and Haller et al., 2018. Yes, the latter is our own SLiM hero, Ben Haller. ;)

Going into the topic of tree sequence recording and the breakthrough it has been for population genetics in greater detail here would be a bit too much (you'll be much better served with the links I provided), but here's the bare minimum you need to know for slendr itself:

Imagine the following trivial slendr model (I will focus on msprime slendr models because they run in an instant and are simpler, but the same will apply to more complex slim simulations using slendr):

library(slendr); init_env()

pop <- population("pop", time = 10000, N = 100)
model <- compile_model(pop, direction = "backward", generation_time = 1)

From this model, we can simulate a tree sequence using slendr's msprime back-end engine (you could also use slim() here):

ts <- msprime(model, sequence_length = 10e6, recombination_rate = 1e-8)

If you inspect the tree sequence object (which is nothing more than a tskit Python object underneath!), you'll se the following in your R terminal:

> ts
╔═══════════════════════════╗
β•‘TreeSequence               β•‘
╠═══════════════╀═══════════╣
β•‘Trees          β”‚        232β•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Sequence Lengthβ”‚   10000000β•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Time Units     β”‚generationsβ•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Sample Nodes   β”‚        200β•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Total Size     β”‚   69.0 KiBβ•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•
╔═══════════╀════╀═════════╀════════════╗
β•‘Table      β”‚Rowsβ”‚Size     β”‚Has Metadataβ•‘
╠═══════════β•ͺ════β•ͺ═════════β•ͺ════════════╣
β•‘Edges      β”‚1216β”‚ 38.0 KiBβ”‚          Noβ•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Individualsβ”‚ 100β”‚  2.8 KiBβ”‚          Noβ•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Migrations β”‚   0β”‚  8 Bytesβ”‚          Noβ•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Mutations  β”‚   0β”‚ 16 Bytesβ”‚          Noβ•‘   <---- no mutations!
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Nodes      β”‚ 582β”‚ 15.9 KiBβ”‚          Noβ•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Populationsβ”‚   1β”‚222 Bytesβ”‚         Yesβ•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Provenancesβ”‚   1β”‚  1.3 KiBβ”‚          Noβ•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Sites      β”‚   0β”‚ 16 Bytesβ”‚          Noβ•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•

Sure enough, there are no mutations, as you've noticed.

Now -- and this is why tree sequences are amazing -- under neutrality, the coalescent process (which generated the genealogies along the genome) is completely decoupled from the mutation process. This means we can first generate the trees, and then just sprinkle mutations randomly on each branch proportionally to the length of that branch. In tskit (or rather it's slendr interface), we can do this with the function ts_mutate():

ts_muts <- ts_mutate(ts, mutation_rate = 1e-8)
> ts_muts
╔═══════════════════════════╗
β•‘TreeSequence               β•‘
╠═══════════════╀═══════════╣
β•‘Trees          β”‚        232β•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Sequence Lengthβ”‚   10000000β•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Time Units     β”‚generationsβ•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Sample Nodes   β”‚        200β•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Total Size     β”‚   84.5 KiBβ•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•
╔═══════════╀════╀═════════╀════════════╗
β•‘Table      β”‚Rowsβ”‚Size     β”‚Has Metadataβ•‘
╠═══════════β•ͺ════β•ͺ═════════β•ͺ════════════╣
β•‘Edges      β”‚1216β”‚ 38.0 KiBβ”‚          Noβ•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Individualsβ”‚ 100β”‚  2.8 KiBβ”‚          Noβ•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Migrations β”‚   0β”‚  8 Bytesβ”‚          Noβ•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Mutations  β”‚ 244β”‚  8.8 KiBβ”‚          Noβ•‘   <---- we have mutations!
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Nodes      β”‚ 582β”‚ 15.9 KiBβ”‚          Noβ•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Populationsβ”‚   1β”‚222 Bytesβ”‚         Yesβ•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Provenancesβ”‚   2β”‚  2.1 KiBβ”‚          Noβ•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘Sites      β”‚ 244β”‚  6.0 KiBβ”‚          Noβ•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•

The (still a little rough) tutorial on the slendr website goes into this in more detail, and shows how you can compute population genetic statistics on such tree sequences. That said, it turns out that it is possible to compute the expected value of almost any practically useful population genetic statistic directly from the tree sequence genealogies themselves, without the need for any mutations. See this awesome paper for much more detail on the topic and this more gentle overview in the tskit documentation.

You should also take a look at our slendr paper which was just recently reviewd and recommended by PCI EvolBiol -- you can read it here. There are a couple of concrete examples and our reasoning for the decision to use tree sequences as outputs.

Good luck with your modelling!

[Given that this answers the original question, I will close this issue now. Hopefully others will find this quick overview also useful.]