Support for non-neutral slendr/SLiM simulations via custom, user-defined SLiM code

bodkan commented 5 months ago

This PR implements the biggest update to slendr to date — the possibility of running slendr/SLiM simulations which are non-neutral. This has been the last big remaining piece of the puzzle that took me a while to figure out. In fact, I think after this is cleaned up and polished, I will be releasing v1.0. Whatever comes down the line will likely involve adjusting / fixing / adding small things here and there, but nothing super major.

The update is too big to summarise here and an extensive description of the new functionality will become a part of a new vignette (probably finished gradually after this PR is merged). However, for posterity, here is a gist of things:

Let’s say we have the following slendr model:

library(slendr)
init_env()

# African ancestral population
afr <- population("AFR", time = 90000, N = 3000)

# first migrants out of Africa
ooa <- population("OOA", parent = afr, time = 60000, N = 500, remove = 23000) %>%
  resize(N = 2000, time = 40000, how = "step")

# Eastern hunter-gatherers
ehg <- population("EHG", parent = ooa, time = 28000, N = 1000, remove = 6000)

# European population
eur <- population("EUR", parent = ehg, time = 25000, N = 5000)

# Anatolian farmers
ana <- population("ANA", time = 28000, N = 3000, parent = ooa, remove = 4000)

# Yamnaya steppe population
yam <- population("YAM", time = 8000, N = 500, parent = ehg, remove = 2500)

# define gene-flow events
gf <- list(
  gene_flow(from = ana, to = yam, rate = 0.4, start = 7900, end = 7800),
  gene_flow(from = ana, to = eur, rate = 0.5, start = 6000, end = 5000),
  gene_flow(from = yam, to = eur, rate = 0.65, start = 4000, end = 3500)
)

model <- compile_model(
  populations = list(afr, ooa, ehg, eur, ana, yam),
  gene_flow = gf, generation_time = 30
)

plot_model(model, proportions = TRUE)

We can run it in a non-spatial setting using slendr's SLiM back-end script to get a tree sequence like this:

ts <- slim(model, sequence_length = <…>, recombination_rate = <…>)

Or we can run it with the msprime back end in a coalescent mode:

ts <- msprime(model, sequence_length = <…>, recombination_rate = <…>)

Obviously, if there’s no map provided, it makes way more sense to run things with msprime, because it’s going to be faster.

Over time, lots of people told me they would like to use slendr for some more complex demographic models (or even just for convenience, because they like R), but they need to simulate non-neutrality — adaptive introgression, background selection, polygenic traits, etc.

Very early on, we have decided to limit the proportion of SLiM-specific functionality that’s supported by slendr directly in its R interface — obviously, the range of non-neutral models that SLiM can write is infinite, but the R interface of slendr can be only as complex while still remaining easy to write, read, and understand. The decision was to only support the part of SLiM that’s needed to support slendr’s (still simplified!) spatial features, and — for non-spatial models — only those features which allow writing models which could be run interchangeably through both SLiM and msprime (our slim() and msprime() functions). Adding support for anything non-neutral for slendr/SLiM simulations would necessarily a) still support only an arbitrary support of what SLiM can do anyway and b) make the R API of slendr much more complex for models which don't deal with selection.

This update implements a simple extension mechanism which allows users to plug in customized SLiM code into the built-in slendr/SLiM back-end script, making it possible to run various non-neutral models which still rely on the demographic part of slendr (like the code above), unchanged. They way we do this is to just let users write arbitrary SLiM code to do whatever they need, but still rely on the convenience of the demographic part of slendr, if that's convenient for them.

I’ll note right here that the SLiM models run by slendr are (and always will be) Wright-Fisher. That’s an assumption built at such a core layer of slendr that it doesn’t make sense to change it now.

Yes, this means that the extension mechanism for supporting selection will likely only be practically useful for non-spatial models (because fancy spatial models basically almost always need a non-WF setting, which is SLiM's domain, not slendr's).

With that out of the way, here’s how things are designed to work at the current stage using a trivial toy example — simulating a trajectory of a beneficial allele in a population over time as a function of:

in which population in the model above the mutation arose
when that mutation arose
and how strongly selected that mutation is

First, the default mode of running SLiM models with slendr relies on this chunk of (hardcoded) code:

initialize() {
    initializeMutationType("m0", 0.5, "f", 0.0);

    initializeGenomicElementType("g1", m0, 1.0);
    initializeGenomicElement(g1, 0, SEQUENCE_LENGTH - 1);

    initializeMutationRate(0);
    initializeRecombinationRate(RECOMBINATION_RATE);
}

Obviously, this has been the root of the issue for people interested in running selection models with slendr. Not only is a single neutral mutation type baked in, but even a single genomic element is hardcoded, no mutation rate, and a fixed recombination rate.

Now let’s say we have the following bit of R code encoding this SLiM "snippet" (it can also be a file). Yes, by itself it doesn't do anything -- everything will be come clear soon:

extension_template <- r"(
initialize() {
    defineConstant("s", {{s}});
    defineConstant("onset_time", {{onset_time}});
    defineConstant("target_pop", "{{target_pop}}");
    defineConstant("origin_pop", "{{origin_pop}}");

    defineConstant("output_file", "~/Desktop/traj_" + target_pop + "_" + origin_pop + ".tsv");
}

initialize() {
    initializeMutationType("m1", 0.5, "f", s);

    initializeGenomicElementType("g1", m1, 1.0);
    initializeGenomicElement(g1, 0, 100000);

    initializeMutationRate(0);
    initializeRecombinationRate(1e-8);
}

function (void) add_mutation(void) {
    // add a beneficial mutation of the given selection strength to a random genome of the
    // population with the name `origin_pop`
    target = sample(population(origin_pop).genomes, 1);
    mut = target.addNewDrawnMutation(m1, position = 50000);
    defineGlobal("MUTATION", mut);

    write_log("adding beneficial mutation to population " + target_pop);

    // write out a header of an output file
    writeFile(output_file, "time\tfreq_origin\tfreq_target");
}

// add a beneficial mutation at a given onset time
tick(onset_time) late() {
    // save simulation state in case we need to restart if the mutation is lost
    save_state();

    add_mutation();
}

// check that the mutation is not lost in every tick after it appeared in the simulation
tick(onset_time):SIMULATION_END late() {
    // the mutation is not segregating and is not fixed either -- we must restart
    if (!MUTATION.isSegregating & !MUTATION.isFixed) {
        write_log("mutation lost -- restarting");

        reset_state();

        add_mutation();
    }

    // compute the frequency of the mutation of interest and save it (if the
    // mutation is missing at this time, save its frequency as NA)
    freq_origin = "NA";
    freq_target = "NA";
    if (population(origin_pop, check = T))
      freq_origin = population(origin_pop).genomes.mutationFrequenciesInGenomes();
    if (population(target_pop, check = T))
      freq_target = population(target_pop).genomes.mutationFrequenciesInGenomes();

    writeFile(output_file,
              model_time(community.tick) + "\t" +
              freq_origin + "\t" +
              freq_target, append = T);
}
)"

The {{thingies}} above are parameter placeholders — they are not required, but are useful for parametrizing the model without having to hardcode fixed values. We can instantiate those parameter placeholders with a new slendr function substitute() like so:

extension <- substitute(
  extension_template,
  s = 0.1, onset_time = 15000,
  origin_pop = "ANA", target_pop = "EUR"
)

substitute() simply plugs in values of given parameters into the template SLiM snippets (taking care of error checking, missing a parameter, extra parameters, etc.). What it produces is a SLiM snippet where all the {{thingies}} are substituted for concrete values. The point of this is to not have to complicate things by specifying those parameters as CLI arguments when running the simulation on the command-line in the background, because the slim() R function would need changes which would complicate its interface.

Finally, the instantiated "extension snippet" can then be used in the good old compile_model() like this, without changing anything else:

model <- compile_model(
  populations = list(afr, ooa, ehg, eur, ana, yam),
  gene_flow = gf, generation_time = 30,
  extension = extension  # <== note here and compare to compile_model() chunk above
)

So, same thing as the base model above, just with one extra parameter extension =. Here the new version of compile_model() swaps out the initialize() { … } block of slendr which enforces neutrality by default by the initialize() {…} block encoded in the user-defined SLiM extension script, and adds all the other code into the SLiM back-end script compiled for the user model.

The final, customized model can be run in exactly the same manner as the original (neutral) version above, except we don’t have to specify sequence_length = and recombination_rate = because they are already provided by the extension SLiM snippet (the slim() functions checks for this and informs the user accordingly if some required information is missing):

slim(model)

In this case, this simulation run produces a file ~/Desktop/traj_EUR_YAM.tsv. In fact, because in this toy example we only care about the frequency trajectory, we could also run the model with slim(model, ts = FALSE), and skip tree-sequence recording altogether.

The substitute() command above makes it easy to wrap it all in a function parametrized by origin_pop and check how does the trajectory of the beneficial allele in EURopeans change depending on how it's expected to "traverse" the admixture graph encoded by the demographic model, giving results like this:

This is only a toy example! The real power of this of course in setting up more complex genomic architectures with exons, introns, non-uniform recombination, multiple mutations interacting in complex ways, particularly in combination with tree-sequence analyses.

Final words: The above should help advanced SLiM users to take any complex slendr demographic model, and overlay an arbitrary genomic architecture or selection landscape on top of it, without having to concern themselves with setting up the demographic history manually in SLiM. For these kinds of scenarios, this sort of thing hopefully leverages the strengths of both slendr and SLiM.

We have currently two concrete projects for this: one simulating a strangely behaving adaptively(?) introgressed locus between Africans-Eurasians-Neanderthals/Denisovans, another one looking at quantitative traits in complex admixture demographic models. These two will be a good test of how well this functionality works and if there’s something I’ve missed. In particular, I haven’t looked yet at whether the way slendr deals with custom fitness effects etc. might somehow clash with user-provided customization code. We’ll see.

More complex examples will be in the vignette. [EDIT: It doesn't look like the html is fully rendered when downloaded from GitHub without the figures. It will look OK after it lands on the website.]

bodkan commented 5 months ago

Ah, another thing to note is that there are a couple of Eidos functions provided by the built-in slendr SLiM script (a slendr "public API" of sorts), which are designed to make writing slendr customization code easier:

population() -- can refer to a slendr population with its "R symbolic name" and get the corresponding Subpopulation object
tick() -- computes a tick value corresponding to a given slendr time (i.e. "what's the tick corresponding to '14 thousand years before the present' in a slendr model)
model_time() -- does the opposite of converting a SLiM-based tick value to a corresponding slendr model time (i.e. "how many years before the present does tick value 1447 correspond to?")
save_state() and reset_state() -- saves/loads the population state and slendr-specific population tags
write_log() -- an easy logging function already used by slendr internally

And then a couple of constants which are internally used by slendr, and might be useful for user-defined SLiM customization code:

SIMULATION_START and SIMULATION_END -- start tick (after burnin, if there's any) and end tick of a simulation of the currently running slendr model
SEQUENCE_LENGTH -- sequence length given to the slim() R function (if any was provided)

bodkan commented 5 months ago

Also, a massive, massive shout out to @bhaller for this:

It's absolutely not an overstatement to say that this is all that made this possible. The 4.2 release came out the very same day (not kidding) that I've been trying (and failing) to figure out the easiest way for users to schedule their own "events" without having to do lots of (often very complex) script block rescheduling.

I got the notification of SLiM 4.2 being released, saw this, and had everything important in place in a matter of a couple of hours.

bhaller commented 5 months ago

Looks amazing; I will look in more detail soon!

petrelharp commented 5 months ago

Wow, very impressive!

codecov[bot] commented 5 months ago

Codecov Report

Attention: Patch coverage is 85.95318% with 42 lines in your changes are missing coverage. Please review.

Project coverage is 90.38%. Comparing base (3e02fa2) to head (160161a).

Files	Patch %	Lines
R/slim.R	72.30%	36 Missing :warning:
R/msprime.R	96.20%	3 Missing :warning:
R/interface.R	66.66%	2 Missing :warning:
R/tree-sequences.R	83.33%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #155 +/- ## ========================================== + Coverage 89.62% 90.38% +0.75% ========================================== Files 9 11 +2 Lines 3095 3171 +76 ========================================== + Hits 2774 2866 +92 + Misses 321 305 -16 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

bodkan commented 5 months ago

I added a bunch of new unit tests, fixed old tests broken by the new changes.

I also renamed substitute() to substitute_values() to avoid clashes with an important base R metaprogramming function.

I will be now using the new functionality extensively in the new ABC R package — this will be a huge stress test for this functionality and should reveal lurking issues. It would be amazing to have some selection-relevant examples for the paper. But this PR has already grown too big, so further problems will be resolved with individual GH issues.

Once everything looks in place I will release v1.0.

Phew. This took way more full-time work than I expected! I’m glad I brushed up on my SLiM though, it’s been too long.

(I also have to think about whether this offer good mechanism for more elaborate tweaking of spatial slendr simulations, beyond the rather limited slendr R interface. But that’s for waaay later.)

bodkan commented 5 months ago

Oh and the new vignette which describes the new functionality is now also rendered on the website. It's super rough, with broken English and probably not the clearest set of examples... but I ran out of energy in this PR and wanted to move on. :) I will be fixing the vignette as I'm polishing everything else on the main branch.

bodkan / slendr

Support for non-neutral slendr/SLiM simulations via custom, user-defined SLiM code #155

Codecov Report