bodkan / slendr

Population genetic simulations in R 🌍
https://bodkan.net/slendr
Other
54 stars 5 forks source link

A few questions regarding tspop integration into slendr #146

Closed bodkan closed 9 months ago

bodkan commented 9 months ago

Hello @gtsambos and @petrelharp,

Thank you again for putting together tspop!

As you know, I’ve been working on using it inside slendr (perhaps adding a new function ts_tracts()to the wider array of tree-sequence R interface functions already present). I have a few questions that I was hoping you might help me clarify. I'm sorry this is so long -- lots of it is background information, the questions themselves (1. and 2. below) are hopefully simple though.

First of all, I decided to open a GitHub issue here rather than in the tspop repo because given that slendr supports demographic models based on a very strict subset of all possible msprime or SLiM simulations, it's quite opinionated (and restrictive) in which kinds of models it supports. As a result, I suspect that even the tspop functionality I'm working with will be tailored specifically for slendr models (at least for now) which are much simpler to everything one might do with pure msprime or SLiM, and so my questions might not apply in the much wider msprime/tskit/tspop context.


I thought it might be easier to frame my questions in the context of an example slendr model like this one (a typical model my users deal with). Let's say that I have two archaic human populations (NEAnderthals and DENisovans), one AFRican group, then a group representing the ancestors of nonAFRicans, which at some point diverges in two populations representing EURopeans and PAPuans. The NEAnderthal population contributed gene flow into the ancestors of EURopeans and PAPuans at some point in the past (the nonAFRican population), and later there was another introgression wave from DENisovans into PAPuans only. Let's say those gene flows took a few thousand years (so are effectively spread across a number of generations). Here's a graphic of the model as built by slendr:

image

Now, I can take this slendr model and run it through its msprime() back end script to get a tskit tree sequences. The script is a normal msprime Python script, except that it doesn't implement a "hardcoded" single model, but is designed to load a slendr model (tables of split times, gene flows, resizes, what have you) and run an msprime simulation from that. Do not try to read that script! Only a few lines of codes are relevant, and I link to them below. The output of that script/model is a normal tskit tree sequence which records some number of diploid individuals, let's say 25 EURopeans and 25 from PAPuans at the end of the simulation ("present day").

What I'm after is adding an easy way for slendr users to extract coordinates of true ancestry tracts from the simulated tree sequence in R. This is what the ts_tracts() R function is designed to do.

For instance, the user might want to:

We do this sort of thing all the time here and elsewhere, not just for humans, of course, and having a single function solution to this would be incredible. This is why I was so excited to read about tspop/link-ancestors!


My questions:

1.

In my msprime/slendr script, I schedule gene flow events using demography.add_migration_rate_change () like this, each ending at some specific time tend (looking backwards in time). For each gene-flow event (two arrows in the figure above), I (currently) schedule a census event at time tend as well. I know in your example you schedule the census event at time equivalent to my t_end + 0.01 (your t_end = 100, to keep things similar to my msprime script). However, when I tested the effect of t_end + x for various values of this "offset" x (even x = 0) on simulations from the model shown above, the results were always the same as long as x was reasonably small.

I understand doing the "offset" is probably related to these guidelines but I don't see why doing census right at the "end" of the migration event (looking backwards in time), so at time 100 (using your example) or time tend in my msprime script, shouldn't also work in ensuring that condition number 2 is satisfied. In fact, I would even say that specifying the census exactly at the time the lineages are migrating should actually ensure that all ancestral lineages are marked by some "unary census marker nodes" in the best possible way, not missing any lineages potentially coalescing in the time between 100 and 100.01? Or is this supposed to ensure correct tract labelling in some more advanced regime of coalescent simulations in msprime?

I have a feeling I'm a) missing something simple and important, b) totally overthinking this, or some combination of both. :)

To ask this question more directly: in which way could users end up with incorrect ancestry tract information if I schedule census at tend, i.e. schedule a census event right at the time tend that the lineages would start moving between populations in a migration event? Basically, is it wrong (and why?) to schedule the census event in your example at time = 100 and not 100.01? Is it wrong only in some circumstances or "nonstandard" modes of operation in msprime but OK in simpler ones? (I know there's the discrete-time Wright-Fisher mode, and others which I have never used and don't unfortunately know much about.)

2.

(This one is probably obvious but I want to be sure I'm not doing some nonsense.) Imagine that a user wants to get ancestry tracts segregating in PAPuans from both the NEAnderthal and DENisovan introgression events. Obviously, looking at the graph of the model above, those tracts originate from two separate admixture events, each of them associated with a census event (let's say at times t_nea and t_den).

Given my understanding of how tspop operates, in order to get complete NEAnderthal and DENisovan tracts in PAPuans, a Python user has to call tspop.get_pop_ancestry() twice and do something like this, right?

# label tracts at census time t_nea
tracts_nea = tspop.get_pop_ancestry(ts, census_time=t_nea)

# => then discard tracts found in non-PAPuan samples (we don't care about EURopeans
#       in this case) and some non-NEAnderthal ancestry tracts that could be there

# label tracts at census time t_den
tracts_den = tspop.get_pop_ancestry(ts, census_time=t_den)

# => then discard non-DEN labelled tracts
#       (for instance, if NEA and DEN introgressions happened directly into PAPuans
#        and were overlapping in time, census at time t_den could also return some
#        NEA tracts the user doesn't care about)

<... combine the two data frames into one ...>

In other words, if I want tracts potentially originating from multiple migration events, I have to run tspop.get_pop_ancestry() multiple times, right? I first thought that the method can accept a list of census times based on this function signature but I think I'm reading this wrong (I learned Python waaaay before Python typing / function signatures were a thing).


I would appreciate any advice on those two questions / humble requests for clarification. Thanks for your help!

bodkan commented 9 months ago

Bonus content, also for my own records for later.

With slendr's new ts_tracts(), when a user calls ts_tracts(<tree sequence>, census = t), slendr then:

  1. Peeks into the model specification, extracting which gene-flow events start at time t (looking forward in time).
  2. Extracts information about which populations are being "sources" of ancestry in those gene-flow events in question.
  3. Internally runs tspop.get_pop_ancestry() on the tree sequence, and filters for tracts corresponding to the sources involved in the gene flow(s) corresponding to that specified census time, discarding the rest.

For the model plotted above in which NEAnderthal introgression starts at 55000 years ago and DENisovan introgression starts at 35000 years ago:

All the above does give me reasonably looking results in terms of expectations of % of ancestry or tract length distributions, but I haven't done any formal testing yet.


Pinging @katiabou so that she's kept in the loop as someone who actively uses this in her work.

bodkan commented 9 months ago

Also, thanks to the nice and tidy format of the result of tspop, when ts_tracts() is run on a non-slendr tree sequences (like a normal pure-Python msprime output), it reduces to a plain and simple reticulate call to your Python method, skipping all the slendr-specific R table formatting that's only used for slendr-annotated tree sequences.

This gives identical results to what you have in your documentation. I don't know who would do this in R even if they knew enough Python to write an msprime script and simulate data... but still, at least this should work without a problem. :)

gtsambos commented 9 months ago

Just seeing this coming back from the airport, Martin-- I will respond tomorrow morning! :)

On Mon, Dec 4, 2023, 5:38 AM Martin Petr @.***> wrote:

Also, thanks to the nice and tidy format of the result of tspop, when ts_tracts() is run on a non-slendr tree sequences (like a normal pure-Python msprime output), it reduces to a plain and simple reticulate call to the Python method with identical results https://urldefense.com/v3/__https://www.slendr.net/articles/vignette-10-tracts.html*pure-msprime-tree-sequence__;Iw!!K-Hz7m0Vt54!h1Dg1o3C3X50unlYmgM9cluBqsjdEjvQzj3SkJtpWmi4oYiqsY_57jbxVe3gNTm5XsYm2fbSTYDI_Uk7koMvsXuC$ to what you have in your documentation.

I don't know who would do this in R even if they knew enough Python to write an msprime script and simulate data... but still, at least this should work without a problem. :)

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/bodkan/slendr/issues/146*issuecomment-1838661683__;Iw!!K-Hz7m0Vt54!h1Dg1o3C3X50unlYmgM9cluBqsjdEjvQzj3SkJtpWmi4oYiqsY_57jbxVe3gNTm5XsYm2fbSTYDI_Uk7khUV6uIU$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEHOXQSVO7XO4XUZDYZBQV3YHXG4ZAVCNFSM6AAAAABAF2P4EWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZYGY3DCNRYGM__;!!K-Hz7m0Vt54!h1Dg1o3C3X50unlYmgM9cluBqsjdEjvQzj3SkJtpWmi4oYiqsY_57jbxVe3gNTm5XsYm2fbSTYDI_Uk7krYfMcgr$ . You are receiving this because you were mentioned.Message ID: @.***>

bodkan commented 9 months ago

A small update on how I'm (at the moment) scheduling a census event for every gene-flow event in the built-in slendr/msprime simulation script as I described it above.

I think I might have discovered a small msprime issue related to scheduling census events?

It turned out that if there are more then one gene-flow events at the same time t, this would mean calling demography.census(time=t) multiple times for the same time t, which causes msprime to crash with:

msprime._msprime.LibraryError: The simulation model supplied resulted in a parent node
having a time value <= to its child. This can occur either as a result of multiple bottlenecks
happening at the same time or because of numerical imprecision with very small population sizes.

Do you think this could be an msprime issue, @gtsambos @petrelharp? For the record, I can reproduce this by duplicating the demography.census(time=100.01) in the toy tspop simulation under this heading (literally just running the same example Python code, but duplicating the line demography.add_census(time=100.01) twice). Running that script modified in this way gives me the same error as above (I'm on msprime v1.2.0).

Would it make sense to do some sort of set(census_times) unique-ing operation on the msprime side like I'm doing now in the slendr msprime simulation script (see below for an example).

Perhaps the error message could be improved? That is, if calling census multiple times for the same time really is a user issue, at least they could be informed on what's wrong (at first I thought I have a bug in the slendr/msprime simulation script). Happy to open an msprime issue on this.


As a result of the above mentioned issue, I modified the built-in slendr/msprime script to first collect census times in a list as individual migration events are scheduled, and then do a second pass over only unique times, calling demography.census() on those. So, like this.

gtsambos commented 9 months ago

Hi @bodkan, thanks for these very considered and thoughtful comments and questions -- again, I'm so glad this has proven useful to you so far!

This error

msprime._msprime.LibraryError: The simulation model supplied resulted in a parent node having a time value <= to its child. This can occur either as a result of multiple bottlenecks happening at the same time or because of numerical imprecision with very small population sizes.

occurs whenever you specify a simulation that produces multiple nodes simultaneously on the same lineage (which is exactly what happens when you put multiple census events at the same time). However, I thought we had updated the error message to mention census events -- perhaps I forgot to do that 🤔 thank you for pointing it out, that is definitely confusing!

Essentially, having multiple 'identical' nodes in a tree sequence will make tskit unhappy, and since the census_event() is placing new nodes on every lineage with brute force, you'll see this error whenever there are multiple censes at the same time. I vaguely remember discussing the possibility of a more 'set-like' census event as you've described, where existing nodes at the specified time are marked as census nodes instead of having new ones added on top of them, but Jerome had some reason why that was infeasible or otherwise not a good idea and I remember being convinced by it at the time, though I can't quite remember what that reason was now. (Will dig through msprime issues later to see if I can find it)

Is there a specific reason why you would like to schedule multiple census events at the same time?

Basically, is it wrong (and why?) to schedule the census event in your example at time = 100 and not 100.01? Is it wrong only in some circumstances or "nonstandard" modes of operation in msprime but OK in simpler ones? (I know there's the discrete-time Wright-Fisher mode, and others which I have never used and don't unfortunately know much about.)

This question is related to the error above, and you're right that it should be okay to set up the simulation as you've described it in many situations. Basically, you'll get an error if you try to put census nodes on top of lineages that already have nodes at the specified time. In coalescent simulations this is rarely a problem, but in discrete-time Wright-Fisher simulations, you will have nodes at all of the integer times. if you are running coalescent sims, then you're right that it should be okay to specify integer census times. The recommendation I put in the docs was just a generalisable one that should work in all situations :)

gtsambos commented 9 months ago

In other words, if I want tracts potentially originating from multiple migration events, I have to run tspop.get_pop_ancestry() multiple times, right?

Yes, that's correct. We set it up this way to ensure that the local ancestry calls are well-defined (for instance, you might have a chunk of genome that derives from a Neanderthal ancestor at time T_N, but that was present in a Denisovan ancestor at some previous time T_D)

gtsambos commented 9 months ago

one final thing -- link-ancestors can take quite a bit of memory when working with very ancient admixture and introgression, sorry about that :/ it might be worth mentioning this to your users in case it comes as a shock!

bodkan commented 9 months ago

Wow, this is absolutely brilliant, Georgia! Thank you for the detailed answers! I think this cleared up all the confusion and answered all the questions I had.


[with regards to the msprime error]

Your explanation makes perfect sense. And I can see why doing some set(census_times) on msprime's side might be too much magic that Jerome et al. might not like for some other, deeper reason. I do agree that making the error message a bit more explicit in mentioning "duplicate" census nodes would clear things up. It only took a minute to figure out the error disappears if I drop the census events from the (otherwise identical) Python script, but still. Perhaps something to add in some future msprime release.


Is there a specific reason why you would like to schedule multiple census events at the same time?

Actually, I wasn't doing this on purpose. In pseudocode, what I do on the slendr/msprime Python side of things is this:

for gf in <gene flows as scheduled by slendr on the R side>:
    call demography.add_migration_rate_change() as defined by a slendr model

Then, in order to be able to extract admixture tracts from any of the scheduled slendr gene-flow events, I modified the loop like this:

for gf in <gene flows as scheduled by slendr on the R side>:
    call demography.add_migration_rate_change() as defined by a slendr model for migration at time t_gf
    call demography.add_census(t_gf)

This worked fine, except for models which had (by chance) multiple gene-flow events happening at the same time t_gf. In that case, that loop ended up calling demograpy.add_census(t_gf) several times, leading to that msprime error we discussed.

To get around this, I'm now calling demography.add_census() on unique-d list of gene flow times.


[census at 100 vs 100.01]

Perfect, I did suspect this has something to do with fancier models, like the discrete WF model. Given that slendr doesn't work with those (not yet anyway), I'll go forward with scheduling demography.add_census() at integer times corresponding to the start times of gene flow events.


one final thing -- link-ancestors can take quite a bit of memory when working with very ancient admixture and introgression, sorry about that :/ it might be worth mentioning this to your users in case it comes as a shock!

Interesting! If that becomes an issue for some people, I will dig into details and perhaps add a section to the documentation. For the time being, on the toy simulations we've tested (a couple of hundreds of individuals, quite old introgression, 100Mb tree sequence) this wasn't a problem at all even on a laptop.

bodkan commented 9 months ago

Thank you so much for all the help! This answered all the questions I had so I think I can close this.

I just released a new version of slendr with tspop-backed new function ts_tracts(). Super excited to have this in!

Speaking of which, the manpage of ts_tracts() links to your tspop module and to your paper as well. I should figure out a better / more explicit, way to nudge people into citing everything that slendr relies on (tskit/tspop, msprime, and SLiM). Perhaps a brief message that gets printed by calling library(slendr) once in a while. But I hope this is at least a reasonable compromise for the time being!