Temporarily store individuals associated with tree-sequence nodes, culling as necessary during simplification

hyanwong commented 6 years ago

At the moment ~~the tree-sequence simplification process removes individuals from the tree sequence unless they are currently alive, or have been flagged by treeSeqRememberIndividuals()~~ individuals are only kept in the tree sequence if they are explicitly Remembered, or are present at the end of the simulation (see https://github.com/MesserLab/SLiM/issues/10). But there is a use-case for retaining in the final tree sequence those individuals containing a coalescence point involving the current samples (i.e. all individuals that have an associated node in the tree sequence). There is discussion of this at https://groups.google.com/d/msgid/slim-discuss/629168f4-4bb6-463a-b0d3-e3281787dceb%40googlegroups.com

hyanwong commented 6 years ago

Of tangential relevance: I want access to individuals (not nodes) because of a slight logical oddity in the current ts-kit tables format, in that some properties belong to nodes (e.g. subpopulation & time) whereas quite similar ones (e.g. spatial 2D location) belong to individuals. I suspect that this is because backwards coalescent simulations can create nodes with subpopulation & time identifiers, but such simulations don't have any notion of individuals. Since 2D location is only really used in forward simulations, and usually only inspected in the currently alive individuals, location has been allocated to the individuals table rather than the nodes table, so as to avoid storing the same location value twice, one for each node in the individual. Of course, the subpopulation & time values may be duplicated like this, but we're prepared to accept that, since they are small values.

petrelharp commented 6 years ago

This is not actually the problem. Simplify retains individuals; the problem is that SLiM only puts those individuals in the individual-table-that-will-be-kept-around-long-term if we ask to Remember them. The details are here: https://github.com/MesserLab/SLiM/blob/master/treerec/implementation.md

Note that the current generation is only added to the individual table on output. So what we want is to add an option, like TreeSeqRememberNodeIndividuals(), say, that goes ahead and adds all individuals to the individual table. That is in principle straightforward, since unneeded ones should get removed by simplify(), but there will be a bunch of annoying bookkeeping.

Perhaps you can edit the title of this issue to reflect this?

petrelharp commented 6 years ago

some properties belong to nodes (e.g. subpopulation & time) whereas quite similar ones (e.g. spatial 2D location) belong to individuals. I suspect that this is because backwards coalescent simulations can create nodes with subpopulation & time identifiers, but such simulations don't have any notion of individuals

That's exactly right; and discussed here: https://msprime.readthedocs.io/en/stable/interchange.html#sec-nodes-or-individuals

petrelharp commented 6 years ago

Also, I agree, this would be a great feature to have.

hyanwong commented 6 years ago

Simplify retains individuals; the problem is that SLiM only puts those individuals in the individual-table-that-will-be-kept-around-long-term if we ask to Remember them.

Ah, I think I see. So actually we would need to store 2 sorts of individuals in the "keep-individuals-around" buffer: purgeable ones, which are there because they are associated with an active node, and non-purgeable ones which we have specifically asked to keep (and the non-purgeable flag takes priority in the case of conflict). Then when we simplify(), we check whether any purgeable individuals have lost their nodes, and delete them from the buffer. Is that right?

hyanwong commented 6 years ago

The details are here: https://github.com/MesserLab/SLiM/blob/master/treerec/implementation.md

This is extremely useful, thanks.

Another point to make (and perhaps this is another issue), is that I can see some people wanting to have access to an specific individual, but only if that individual has left some genetic ancestry in the samples. For example, you might want to look at past individuals in which existing (forward-time) mutations occurred, but without having to keep track of those mutant individuals whose mutation went extinct. I guess this is applying another sort of flag to a potentially retained individual, but this is even more tricky, because once we simplify(), we are likely to remove the node that points to this individual, and so we can't tell if any tree lineages pass through that individual or not. Thinking about it, the solution might involve a modification to the simplification process, such that once an individual is flagged, its nodes are retained even after simplification, as long as they are nodes that are intermediate in one or more trees in the final TS, and not simply nodes whose descendants have all vanished.

petrelharp commented 6 years ago

the solution might involve a modification to the simplification process, such that once an individual is flagged, its nodes are retained even after simplification, as long as they are nodes that are intermediate in one or more trees in the final TS, and not simply nodes whose descendants have all vanished.

I see - you want to retain nodes/individuals through simplify, but only if they are ancestral to the samples. That's a feature request for tskit before thinking about it in SLiM, I think, but I'd first want to see a good demonstration that it makes a difference.

petrelharp commented 6 years ago

So actually we would need to store 2 sorts of individuals in the "keep-individuals-around" buffer: purgeable ones, which are there because they are associated with an active node, and non-purgeable ones...

Sort of: we don't need to actually do the managing of purging-or-not of them explicitly: simplify will discard individuals that are not referenced by a retained node; and since we are already ensuring that all the individuals we don't want to purge are referenced by the samples passed to simplify, if we put more individuals in the individual table, everything should Just Work.

The most tricky thing is going to be making sure we don't stick the same individual in the table twice; and updating the individual if it's already in there. We're already doing stuff like that for RememberIndividuals, but it was a bit painful to get working right.

hyanwong commented 6 years ago

I see - you want to retain nodes/individuals through simplify, but only if they are ancestral to the samples. That's a feature request for tskit before thinking about it in SLiM, I think, but I'd first want to see a good demonstration that it makes a difference.

Yes, I agree. This is a different issue.

hyanwong commented 6 years ago

simplify will discard individuals that are not referenced by a retained node;

How convenient!

The most tricky thing is going to be making sure we don't stick the same individual in the table twice; and updating the individual if it's already in there. We're already doing stuff like that for RememberIndividuals, but it was a bit painful to get working right.

OK, thanks for the pointers. I won't have time to do this in the next few weeks, but I might know someone who might be able to put some time into it.

hyanwong commented 6 years ago

we don't need to actually do the managing of purging-or-not of them explicitly: simplify will discard individuals that are not referenced by a retained node; and since we are already ensuring that all the individuals we don't want to purge are referenced by the samples passed to simplify, if we put more individuals in the individual table, everything should Just Work.

Hang on, so you mean that in e.g. a WF model, I can just mark every individual, apart from those in the last generation, with treeSeqRememberIndividuals(), and I'm done? If I do this, then call simplify(), I won't find that I'm left with (N * generations) individuals in the tree sequence table, but instead something smaller? If so, I feel this issue is basically done for me. I'm not sure what you mean by book-keeping in this case (see below).

The most tricky thing is going to be making sure we don't stick the same individual in the table twice; and updating the individual if it's already in there.

Presumably this is not a problem for a model with discrete non-overlapping generations, though? And even for other models, we just need to keep a set of the individual IDs (assuming they are unique), and not add if an individual is already in the set. If we don't update, then the location and other metadata will be associated with the birth place of the individual, rather than the place of death (or reproduction), but that seems acceptable to me.

petrelharp commented 6 years ago

I can just mark every individual, apart from those in the last generation, with treeSeqRememberIndividuals(), and I'm done?

Nope: individuals who you Remember will be kept around forever in the tables, so you would indeed have N * num_generations entries at the end.

The most tricky thing is going to be making sure we don't stick the same individual in the table twice; and updating the individual if it's already in there. Presumably this is not a problem for a model with discrete non-overlapping generations, though?

Well, even in a WF model you will need to check if someone has already been Remembered. But as I said, we're already doing something similar.

bhaller commented 6 years ago

Nope: individuals who you Remember will be kept around forever in the tables, so you would indeed have N * num_generations entries at the end.

@petrelharp But if he then simplifies (after unmarking everything as a sample except the final generation and perhaps the first generation if he wants it), that would then get him what he wants, right? All the intermediate cruft would get simplified away unless it was involved in an ancestral coalescence event, but the individuals as well as the nodes associated with ancestral coalescence events would be kept, right? If he is OK with paying the memory/speed penalty associated with doing all that remembering, it seems like that would work, wouldn't it? Or am I misunderstanding?

petrelharp commented 6 years ago

But if he then simplifies (after unmarking everything as a sample except the final generation and perhaps the first generation if he wants it), that would then get him what he wants, right?

Yes, that's right; t I was explaining that it's not currently possible to do this with the treeSeqRememberIndividuals() method. One can postprocess the output to remove the extras; but you can't do it, currently, within SLiM as the simulation goes along.

hyanwong commented 6 years ago

I feel I'm missing something here - apologies about the long thread. You said "simplify will discard individuals that are not referenced by a retained node". To me that implies that simplify (which is presumably the method called regularly when creating tree-sequences during a SLiM run) will not discard individuals that are referenced by a retaining node. So at the end of the simulation, we still have access to individuals that are referenced by nodes in the TS. Is that right? Am I perhaps wrong in assuming that simplify() on a tree-seq, and the simplification process during SLiM tree-seq simulations are the same?

petrelharp commented 6 years ago

Does this note (edited from above) clear things up?

Simplify retains individuals; the problem is that SLiM only puts those individuals in the individual table if we ask to Remember them. The details are here: https://github.com/MesserLab/SLiM/blob/master/treerec/implementation.md

Note that the current generation is only added to the individual table on output. So what we want is to add an option, like TreeSeqRememberNodeIndividuals(), say, that goes ahead and adds all individuals to the individual table. That is in principle straightforward, since unneeded ones should get removed by simplify(), but there will be a bunch of annoying bookkeeping.

So, most individuals never get put in the individual table at all. Only their corresponding genomes, as nodes.

hyanwong commented 6 years ago

Ah, that makes sense. So

Most individuals never get put into the individuals table
If an individual is put into the individuals table using treeSeqRememberIndividuals(), it never gets deleted, even if the node that references them disappears from the TS (is this right, or is it that simplification never gets rid of the nodes of a Remembered individual?)
I want a way of adding an individual to the individuals table without explicitly Remembering it (since according to (2) an explict Remember would cause it to persist indefinitely)
As long as I can do (3), the task of removing individuals with no referencing nodes is already in the codebase (this seems a little strange to me - I don't understand when it would ever be used, since I understood that the only way currently to add individuals to the table is to Remember them, and according to (2) that means they should never be removed).

I'll adjust this list as and when I'm corrected on this. As you see I'm still a little confused. Sorry!

petrelharp commented 5 years ago

Edited:

Most individuals never get put into the individuals table.
Individuals are added to the Individual Table only if they are (a) part of the intiial generation, (b) Remembered, or (c) part of the final generation.
If an individual is put into the Individual Table using any of these methods, its nodes are included in the list of samples passed to simplify(), and so its nodes are never removed, and neither is the individual.
You want a way of adding all individuals to the Individual Table without adding their associated nodes to the list passed to simplify(), so that they might be removed.

hyanwong commented 5 years ago

I'm picking up this thread again because of conversations with @gtsambos. One re-reading, this sounds like a rather easy change to make to SLiM - I just need an extra option to treeSeqRememberIndividuals() (e.g. ~~purge_if_unreferenced~~ is_sample=False - see below) that flags up this individual as one whose nodes should not be passed to simplify() during auto-simplification.

Would there be any objection to me trying to implement this? @petrelharp mentions annoying book-keeping, but I can't image it would be very tedious.

As a relevant aside, a simple change like this will automatically keep individuals associated with coalescence nodes. If I also want to keep individuals associated with a unary node (e.g. a unary node with a mutation above it), I either need to keep all unary nodes, or flag up that particular node for keeping during simplify() unless it is no longer referenced in the TS - this turns out to be a new feature of the tskit simplify() implementation that @gtsambos is working on right now.

bhaller commented 5 years ago

Well, as to the change itself, @petrelharp understands this stuff, and what you're trying to achieve, far better than I do. I would suggest that you would want to work closely with him on this to ensure that your changes make sense and are compatible with all the other options and switches involved with tree-sequence recording. I wouldn't want to accept this pull request unless it worked with cross-checking, with coalescence checking, with outputTreeSeq() and readFromPopulationFile(), etc., and I wouldn't want to accept it unless @petrelharp gave it a thumbs-up after a detailed code review. But that said, if he is on board then I'm on board.

If this goes forward, make a new pull request branch and make me a contributor on it, and I'll be happy to put in the necessary code to give you a new parameter on initializeTreeSeq(); adding new Eidos API is a little bit technical so it might be simpler for me to do that for you.

hyanwong commented 5 years ago

Thanks @bhaller . I will definitely do this. @petrelharp - I have been looking through https://github.com/MesserLab/SLiM/blob/master/treerec/implementation.md which is extremely helpful. Perhaps if I make changes to this documentation on a fork, you can see if my suggested approach is the right one?

The way I see it, there are some individuals in intermediate generations that you might want to Remember because they represent samples from the past - i.e. you have the full genome. Other times you might want to remember individuals because you need to keep information about the individual itself, but it is not explicitly thought of as a sample that persists independent of subsequent demographic events. Thus I suggest that RememberIndividuals() has the parameter is_sample, which by default is True. If set to False, this individual is only remembered if it contains a node that is present in the final tree sequence.

Happy to continue the discussion either on this thread or via any other suggested means.

bhaller commented 4 years ago

Is this issue still live?

petrelharp commented 4 years ago

Well, yes, it's open and would be nice, but thinking on it more (sorry for the delay! although in #40 we did say we'd talk about it in October, but didn't...) - it's a can of worms. I think what this is requesting, really, is the ability to add all individuals (or, maybe just a subset, but might as well be everyone) to the Individual Table, but without marking their nodes as samples, so that maybe they get retained through simplification and maybe not. This would be very nice! But a bunch of SLiM's bookkeeping is tied up in exactly who is in the Individual Table at the moment, so it'd take a reworking of the details here (maybe just adding a new flag would suffice, but as I recall there's a lot of trickiness having to do with the order individuals are stored, etcetera...).

So: I'm happy to advise on this can of worms, but not take it on.

bhaller commented 4 years ago

OK. I'll leave it open and we'll see what happens; I'm just reviewing issues before releasing SLiM 3.4. Thanks.

bhaller commented 3 years ago

This issue is done, right? Not sure why it didn't get auto-closed when the PR was accepted. I'm going to close it now; please reopen it if I'm confused.

bhaller commented 3 years ago

Realized I didn't delete the associated branch when this was closed. I will delete it now, but if that is a mistake it can of course be resurrected. :->

MesserLab / SLiM

Temporarily store individuals associated with tree-sequence nodes, culling as necessary during simplification #25