Generalizing the interface for specifying genomic architectures / mutation types

Hi Ben,

I finally got to summarizing the thoughts I had on the topic of relaxing the requirement for specifying the features of genomic architecture and mutation types in SLiM. What follows are slightly expanded and clarified ideas I sent you over email a couple of days ago.

I will start with a slendr-focused motivation, to explain my reasoning and later generalizing this idea to a wider context (I hope). So, apologies for the long introduction... I'm also selfishly using this as an opportunity to present my thoughts so that they can be potentially shot down in case they turn out to be rubbish. :)

I've been thinking about ways to allow the extension of slendr demographic models (currently purely neutral, intended to simulate neutral tree sequence data and nothing more) towards more complex genomic architectures, involving custom genomic elements, non-neutral mutation types, etc.

At the same time, I'd like to make the above possible while keeping the current slendr architecture intact: an R interface which effectively controls a fixed SLiM back-end script (specifically designed to do one thing and do it well -- interpret slendr models as they are currently self-referentially defined by what slendr can do).

One simple idea I had for this was that a user could provide a snippet of custom SLiM code (maybe an initialization block specifying genomic elements, mutation types, etc., perhaps some custom callback functions which would add beneficial mutation at a certain times in certain locations, etc.). This could happen in the compile_model() phase of a slendr model definition in R, get included into the core SLiM/slendr back-end script, and the simulation would then be executed with the standard slim() runner function.

What I like about this approach is:

It lets slendr to its thing: provide an interface for the specification and execution of a fairly well defined subset of spatial/non-spatial demographic models, without complicating the code further while, at the same time, ...
... let SLiM do its thing too: allow users to extend those simple slendr models with other cool things I could never ever implement an R interface for (probably not everything, but certainly lots of interesting selection scenarios with custom genomic elements, various mutation types, etc.).

The issue I'm currently facing is this (and now we're getting to the changes that I've been proposing in my email).

What we currently do on the slendr back end is something like this:

[... slendr initialize() block in the back-end script...]

initializeMutationType("m0", 0.5, "f", 0.0);
initializeGenomicElementType("g1", m0, 1.0);
initializeGenomicElement(g1, 0, SEQUENCE_LENGTH - 1);
initializeMutationRate(0);

[...rest of the slendr initialize() block...]

Basically, I'm adding a "dummy" mandatory mutation type m0 which is never used because mutations are overlaid on top of the tree sequence during downstream tree-sequence processing (our ts_load() & ts_mutate() in slendr). I do this because initializeGenomicElementType requires a mutation type to be specified.

Now, given all this, is there any chance we could do just

initializeGenomicElementType("g1", NULL, NULL); // to keep the interface of initializeGenomicElementType the same
initializeGenomicElement(g1, 0, SEQUENCE_LENGTH - 1);

I.e., skipping the specification of mutation types completely? Coming back to my original motivation for this, this would allow the user to provide something like this outside of the fixed slendr back-end script in some custom SLiM snippet:

// custom SLiM/slendr initialization block in a separate code snippet file which would be provided to the
// `compile_model()` step in slendr (just a sketch of an idea)
initialize() {
    initializeinitializeMutationType("m0", 0.5, "f", -0.0001); // create negative mutation type
    g1.addMutationTypes(m0, 1.0); // assign mutation type to the genomic element type created by the "fixed" slendr backend

    // because this all would make mutations entirely optional (the base slendr back end wouldn't  specify any mutation types)
    // mutation rate would be also optional
    initializeMutationRate(1e-8);
}

In other words, my proposal here is to make the default SLiM behavior assume as little about genomic architecture as possible.

From my own slendr perspective, now that I'm thinking about this more, the only thing that would be actually needed is the total length of the sequence, nothing more! That's because the most basic output of a neutral simulation is just a tree sequence (effectively, an abstract linear sequence which has been potentially broken up by recombination). In this setup, everything else (mutation types, different genomic elements which would have assigned those mutation types at various proportions, mutation rates, etc.) would be specified in an optional user-defined "initialization SLiM snippet".

I hope this makes at least some kind of sense!

Hi @bodkan. Typing fast here as I have to go soon.

One thing: I suspect that you will simply have to get into dynamic script generation at some point. As you start wanting to let the user plug bits of script in here and there, it seems like it will get increasingly difficult to accommodate that without generating the script on demand. Is there really a problem with doing so? It seems like it would be pretty straightforward...?

But as regards this specific proposal, I think that what I am planning to do may satisfy your need here. I have been thinking that SLiM 4 will offer a couple of different ways of setting up the genetics:

An initialize() callback with no genetic setup at all – no mutation types, genomic element types, mutation/recombination rates, nothing. This would give an "ecological" model with a zero-length chromosome and no mutations. This will be useful in the multispecies context, particularly, for modeling species that don't need genetics.
An initialize() callback with only one genetics-related call: initializeChromosome([i$ number = 1], [i$ length = 0]). This would set up an empty chromosome of a given length. This queues up support for multiple chromosomes down the road; I could add a chromosome type (autosome, X, Y, mitochondrial), different numbers and lengths could be supplied, etc. This call could then be followed by the standard genetic initialization setting up mutation types and so forth; or those calls could be omitted to get an empty chromosome of the given length.
The setup as it is now; in this new paradigm, this would imply initializeChromosome(1, NULL), where NULL infers the length of the chromosome from the maximum of the genomic elements and recombination/mutation map lengths, as now.

So slendr would want to use form (2), with an initializeChromosome() call and nothing else. User script to specify additional genetic structure beyond the length could be plugged in after that call; if nothing is plugged in, then you have an empty chromosome. I think this is what you're asking for, if I understand correctly.

Thoughts? Tagging you also, @petrelharp.

I have just implemented (1). (2) will wait until there is feedback here; if (1) suffices for your purposes then I will put off (2) indefinitely (but I suspect it does not).

[...] I think that what I am planning to do may satisfy your need here. I have been thinking that SLiM 4 will offer a couple of different ways of setting up the genetics:

An initialize() callback with no genetic setup at all – no mutation types, genomic element types, mutation/recombination rates, nothing. [...]

An initialize() callback with only one genetics-related call: initializeChromosome([i$ number = 1], [i$ length = 0]). This would set up an empty chromosome of a given length. [...] This call could then be followed by the standard genetic initialization setting up mutation types and so forth; or those calls could be omitted to get an empty chromosome of the given length.

[...]

So slendr would want to use form (2), with an initializeChromosome() call and nothing else. User script to specify additional genetic structure beyond the length could be plugged in after that call; if nothing is plugged in, then you have an empty chromosome. I think this is what you're asking for, if I understand correctly.

You understood my request perfectly, (2) is exactly what I had in mind and it would be amazing to have that.

Assuming I had (2), the only thing that the slendr SLiM back end would contain in its initialize() callback would be initializeChromosome(1, SEQUENCE_LENGTH) (under the current setup), and the user could then provide their own initialize() callback that would build on top of that "blank" chromosome (custom genomic elements, custom mutation types, mutation rates, etc.).

I think it would work perfectly and would cover at least 80% of uses for people who are interested in slendr but would like to study some selection scenarios. For instance, we have a student using hacked a slendr SLiM back end to trace the spread of a beneficial allele across a landscape (under different scenarios). Having (2) would immediately cover that use case and even other, slightly more complex, genetic architectures.

Thank you Ben!

On the more philosophical topic:

One thing: I suspect that you will simply have to get into dynamic script generation at some point. As you start wanting to let the user plug bits of script in here and there, it seems like it will get increasingly difficult to accommodate that without generating the script on demand. Is there really a problem with doing so? It seems like it would be pretty straightforward...?

Well, the problem I see is that I would simply have to draw the line somewhere as to how much of SLiM's functionality I would provide an R interface for (i.e. things that would be dynamically generated). What's in slendr right now (just fairly straightforward spatial and non-spatial demographic models) already requires a massive amount of R code and lots of unit testing to make sure that code works.

Which is why I thought that at least for specifying non-neutral genomic features (exons, custom mutation types, etc.), providing a customized SLiM snippet file might suffice, keeping the rest of the demographic back end as it is.

By the way, for even more complex models, the users always have the option to customize the SLiM back end itself and plug the modified script to the compile_model() function as a whole. That way, the whole simulation + analysis pipeline remains fully reproducible.

Somehow I just feel that I should let the advanced users be able to use SLiM for what SLiM does best without having to rely on a 2nd class citizen R interface which would -- even in the best case scenario -- cover only a subset of all that SLiM can do... and which would most likely be less pleasant to use than SLiM itself (i.e., the users wouldn't be able to use the SLiM manual as it is).

So, I'm considering all options, but on the short-term horizon, the only extension I envision is the one that expands the possibilities for customizing genomic architectures and custom mutations--again, something that (2) would cover perfectly. We will see about the rest!

Thanks for the clarification, @bodkan.

Re: script generation and your philosophical note, yes, I agree completely. The goal is not to try to wrap everything SLiM can do in slendr! I just meant that you could solve this problem you have with a bit of simple dynamic script generation. If the user supplies a genetic initialization snippet, then emit a back-end script that contains that; if they don't, then emit a back-end script that instead contains the standard boilerplate genetic set-up. Then this could work with SLiM 3, and there would be no need for initializeChromosome() at all; you have it within your power to generate a working script for both cases, right? Honestly, this would be my fairly strong preference for how to handle this issue. I could add initializeChromosome() now to fix this issue, but really that function is envisioned as being part of a much larger scheme that will enable simulation of multiple chromosomes in SLiM, of different types (autosome, X, Y, mitochondrial...). I would strongly prefer not to add it until I have written up a full design spec for that entire feature, so that I don't accidentally write initializeChromosome() in a way that does not fit well into the final design.

In fact, you can probably do it without even using dynamic script generation. If the user's dynamic genetic init snippet is supplied as a command-line parameter that is a string executed with executeLambda() (I suppose that is what you were planning?), then you could do something like:

if (exists(USER_SNIPPET)) { executeLambda(USER_SNIPPET); } else { initializeMutationType(...); initializeGenomicElementType(...); initializeMutationRate(...); ... }

Again, that should work in SLiM 3.7, whereas relying on the initializeChromosome() feature will make you SLiM 4 only. Upon reflection, I'd really inclined to say that initializeChromosome() will not be added in SLiM 4, for the reasons explained above. I will add it if there is some reason that these other approaches simply won't work for you; but I think I'm going to push back hard first. :-> Rule #1 of software engineering: don't start writing code until you have a design spec. Otherwise, you end up tied in knots, and find that you are forced to break backward compatibility to achieve the design you want.

Hello @bhaller!

Re: script generation and your philosophical note, yes, I agree completely. The goal is not to try to wrap everything SLiM can do in slendr! I just meant that you could solve this problem you have with a bit of simple dynamic script generation. If the user supplies a genetic initialization snippet, then emit a back-end script that contains that; if they don't, then emit a back-end script that instead contains the standard boilerplate genetic set-up. Then this could work with SLiM 3, and there would be no need for initializeChromosome() at all; you have it within your power to generate a working script for both cases, right?

[...]

In fact, you can probably do it without even using dynamic script generation. [...]

if (exists(USER_SNIPPET)) { executeLambda(USER_SNIPPET); } else { initializeMutationType(...); initializeGenomicElementType(...); initializeMutationRate(...); ... }

Again, that should work in SLiM 3.7 [...]

Again, as many times in the past, you have shown to have much better insight into many aspects of what I'm trying to do with slendr than myself. :) I appreciate the suggestion. This really seems like something that would unlock the potential for non-neutral genomic architectures very nicely without requiring any other dramatic changes (and without having to wait for SLiM 4). What I had in mind as a workaround was much uglier, which is what originally motivated me to reach out to you.

I would strongly prefer not to add it until I have written up a full design spec for that entire feature, so that I don't accidentally write initializeChromosome() in a way that does not fit well into the final design.

This sounds entirely reasonable and I agree with your perspective. "Design spec first" is a methodology I have learned to follow myself (the hard way). One reason why I have not yet delved into the raster-based spatial simulations. Still trying to figure out what is it that this thing should do and how, before writing a single line of code.

So, I guess this all settles my question. I will try to plug in the executeLambda-based extension mechanism and start from there. I don't expect any issues with this, but I will ping you in case I run into some.

Thank you again! (I think this issue can be closed now.)

MesserLab / SLiM

Generalizing the interface for specifying genomic architectures / mutation types #309