bodkan / slendr

Population genetic simulations in R 🌍
https://bodkan.net/slendr
Other
54 stars 5 forks source link

why do the number of simulated individuals differ between the same model run by SLiM and msprime? #94

Closed dangliu closed 2 years ago

dangliu commented 2 years ago

Hey Martin,

This question came from reading your tutorial: https://www.slendr.net/articles/vignette-07-backends.html when you output the slim_nogf and msprime_nogf, I noticed the number of individuals in slim_nogf is 30211 while the one in msprime_nogf is 104, could you explain why?

Thanks, Dang

bodkan commented 2 years ago

In the msprime simulation in the tutorial you linked, we explicitly sampled 50+50+4 individuals using the schedule_sampling call. The way coalescent simulation works is that it builds the genealogical histories of only the samples of interest backwards in time. In this case, we start with 208 chromosomes (2*104) and msprime builds the genealogical history for just those 104 individuals. Importantly, unlike the SLiM simulation discussed in the next paragraph,, the internal nodes of those genealogies do not have individuals associated with them.

SLiM works differently. As a forward simulator, it explicitly simulates the history of every single individual that has ever lived, generation by generation. Because of the specifics of the internal tree sequence recording, simplification and other more advanced things (that you can read about here) that happen throughout the course of the forward simulation, we end up not just with the 104 sampled individuals but some ancestral nodes (unsampled chromosomes of ancient ancestral individuals) are also marked as individuals in the tree sequence table.

Basically, the bottom line is, that for technical reasons we end up with more individuals in the SLiM tree sequence simulated here than just those that were explicitly sampled. That said, both tree sequences contain exactly 104 sampled individuals (i.e. 208 sampled chromosomes) that are truly marked as sampled. In the SLiM tree sequence, all the other individuals are so-called "retained", they are not samples with their full genomes recorded.

If that's confusing, that's pretty normal (I hope, because it really confused me). It took me many reads of the tutorial to understand how tree sequence recording works for SLiM simulations and get what exactly is going on, and to understand the terminology about nodes-vs-individuals but also about remembered-alive-retained individuals. I encourage you to read this if you want to know more.

Because that documentation I just linked contains the answer, I will close this now. Feel free to reach out to the tskit folks with the specifics about tree sequence recording.