Should write our pedigree data to .trees files

bhaller commented 3 years ago

Now that https://github.com/tskit-dev/tskit/issues/852 has been fixed, I suppose SLiM ought to convert the pedigree info it has into the corresponding tskit info. I have the impression that pyslim does this already, though (?). Does it make sense to move responsibility for this into SLiM in some way, or can pyslim construct the info in the tskit table from the pedigree info that SLiM already provides? I suppose the less that is done in SLiM the faster forward simulation goes; doing this additional bookkeeping during forward simulation would only be additional overhead. Thoughts @petrelharp?

petrelharp commented 3 years ago

Those are my thoughts exactly! It'd be nice to do, but if it slows things down substantially it's not be worth it.

The state of things is: we can reconstruct individual parentage exactly whenever both parent and child are remembered. If we moved it in to SLiM then we'd get to know a bit more: individual parentage between Retained individuals for whom we don't have entire genomes. But I'm not actually sure if there's any workflows this would help out: to have parent-child individual relationships for sure, we need to be Retaining everyone, and in that case you know that node parent-child relationships imply individual parent-child relationships anyhow.

I think the biggest advantage is that it would eliminate the nastiest code in pyslim. And, I suppose if it were optional, then maybe it wouldn't slow things down noticably when it's not in use? But I suppose if it were optional then we'd still need to keep the pyslim method.

(Note: as we just learned, if A selfs to produce B, then A and B produce C, but then A is remembered and B is not, then the result is indistinguishable to pyslim from A selfing to produce C. But... that might actually be indistinguishable given the individual parent column as well.)

Also note that we can't actually use this yet, as we need at minimum simplify to update the individual parent column, which currently it doesn't.

bhaller commented 3 years ago

Those are my thoughts exactly! It'd be nice to do, but if it slows things down substantially it's not be worth it. ... Also note that we can't actually use this yet, as we need at minimum simplify to update the individual parent column, which currently it doesn't.

Hmm. Well, getting rid of nasty code in pyslim is a noble aim, and SLiM does have the information. I am tempted to say that, rather than keeping this information in the parent column as we go along, having it update across simplify() correctly, etc., which does sound like it might have a performance impact, maybe the right design would be to fill it all in at save time, on demand. Saving would construct a tree sequence, simplify, and then fill in the parent info in the table immediately before writing to disk. However, I guess by that point SLiM no longer has all the info it needs; the Individual objects have ceased to exist for most of the tree. So... hmm.

One solution would be for SLiM to put the individual pedigree info – pedigree IDs for the parents of each individual in the run, keyed by the offspring pedigree ID – in a separate table that never gets purged. Then, after simplifying, we could just look up the pedigree ID for each individual in that table and get the pedigree IDs of its parents, and construct the parent column from that. But that table would grow linearly with the generation number, since it wouldn't be purged by simplification; even with just three pieces of info per individual, it might get too big. Probably not the right design.

OK, then. The other approach would be to record parent pedigree IDs as part of the metadata in the individuals table. Then the info goes away for all individuals that get simplified away; but since it's SLiM pedigree IDs, not tskit ids, there is no need for simplify() to update the ids. The only cost would be adding the two ints to the individual metadata, plus a little processing overhead at save time. Not bad maybe.

Or we could play it by the book, and record the tsk ids into the parent column as we go, and let simplify() fix those ids once it has been upgraded to do so. That's certainly the simplest for SLiM; my concern is that fixing all the ids every time simplify() runs might be significant overhead. The whole thing could be optional, as you say; but I kinda hate the proliferation of optional-functionality flags, so I'd rather find a solution that is low enough overhead that it can just be turned on all the time.

I think my vote is: wait for simplify() to be able to do it for us, and then do some profiling to see whether it adds a noticeable amount of overhead. If it's just, say, 1% of total simplification time, then I'd say let's just do it that way, without a flag.

petrelharp commented 3 years ago

Hm, I hadn't been thinking properly about this: recall that individuals aren't like nodes - they're recorded when we ask, not at birth, when we don't necessarily have the information about their parents.

If we keep track of individual parents, then there will be zero effect on runtime tree sequence stuff, because we've got no individuals in the individual table unless we Remember them. So for most recipes, we'd only need to deal with this when we save out the tree sequence. But, is that information actually available at that time? We'd have to have all individuals know who the IDs of their parents, which I thought was something that you tried enabling by default, but discovered it had a large effect on runtime?

It sounds like you're suggesting something like an alternate mode of tree sequence recording, where all individuals are Retained, at birth, when we have their parent information. Then simplify would indeed just take care of everything.

Or, maybe the individual parent information is only filled out if it's available, i.e., if pedigree tracking has been turned on in SLiM. Hey, wouldn't this make the most sense?

bhaller commented 3 years ago

Hm, I hadn't been thinking properly about this: recall that individuals aren't like nodes - they're recorded when we ask, not at birth, when we don't necessarily have the information about their parents.

No, we do. TrackParentage() gets calls for every individual in tree-seq models (individual.h:134). It records the pedigree ids of the parents (and grandparents).

If we keep track of individual parents, then there will be zero effect on runtime tree sequence stuff, because we've got no individuals in the individual table unless we Remember them. So for most recipes, we'd only need to deal with this when we save out the tree sequence.

Good point re: it applying only to remembered (or retained) individuals, and first-gen / extant individuals too I suppose. Maybe that means we have all the information we need at save time, without needing any redesign?

But, is that information actually available at that time? We'd have to have all individuals know who the IDs of their parents, which I thought was something that you tried enabling by default, but discovered it had a large effect on runtime?

That was turning on pedigree tracking for all models, so that we could just get rid of the pedigrees_enabled_ flag altogether, which would simplify the code a bit. But pedigree tracking is still on whenever tree-seq recording is enabled, and that includes recording parent pedigree ids (because I didn't want to have another flag in the code).

It sounds like you're suggesting something like an alternate mode of tree sequence recording, where all individuals are Retained, at birth, when we have their parent information. Then simplify would indeed just take care of everything.

No, we don't want to go there. Users who want that can do it themselves; but getting parent info from the individuals table shouldn't need that.

Or, maybe the individual parent information is only filled out if it's available, i.e., if pedigree tracking has been turned on in SLiM. Hey, wouldn't this make the most sense?

Indeed! Which is always the case for tree-seq models! :->

petrelharp commented 3 years ago

Or, maybe the individual parent information is only filled out if it's available, i.e., if pedigree tracking has been turned on in SLiM. Hey, wouldn't this make the most sense?

Indeed! Which is always the case for tree-seq models! :->

Heh. Ok then. =) Once simplify deals with individual parents, this should be quite easy.

bhaller commented 3 years ago

Or, maybe the individual parent information is only filled out if it's available, i.e., if pedigree tracking has been turned on in SLiM. Hey, wouldn't this make the most sense?

Indeed! Which is always the case for tree-seq models! :->

Heh. Ok then. =) Once simplify deals with individual parents, this should be quite easy.

So... what's your plan, then? I'm not clear where we've ended up.

petrelharp commented 3 years ago

Oh! I think that we just put the information in to the individual tables, and let simplify deal with it, once it can do so: https://github.com/tskit-dev/tskit/issues/1136 . I don't think we even worry about it slowing things down, since it will only slow things if there are lots of individuals being remembered, and it shouldn't slow things much at all.

bhaller commented 3 years ago

OK, let's give it a go. Just putting the extra info into the individual tables in the first place might have a noticeable performance impact; we'll see. In that case we might need a pref.

bhaller commented 3 years ago

OK, things are getting wrapped up. :-> Is this bug making it into SLiM 3.6? If I understand correctly, it would entail rolling tskit's code into SLiM again, right? Is the appropriate tskit version tagged as a C release, or would we be waiting for other loose ends to get tied up before that could happen?

petrelharp commented 3 years ago

If you want to get this working in SLiM then it'd be easy to copy in the tskit code. And I guess I could write the pyslim stuff also. It is not completely trivial: recall that we'll have pedigree IDs of parents, but not their IDs in the individual table (or even whether they're in the table or not). So to do this, I guess we'll need to keep track of a "pedigree id -> individual table" mapping. Let me know what you think.

bhaller commented 3 years ago

Hmm. OK, on reflection let's push it. Sounds a bit complicated; nobody has asked for it yet; it'd be good to have time for it to shake out, and this release is just about ready to go; and I'd prefer to keep the tskit code in SLiM synced to formal C API releases to the extent possible, so that we're not getting pre-release stuff from them that might change. So, pended indefinitely. :-> Thanks.

petrelharp commented 3 years ago

It would be nice to do it soon, since it'd be almost free and it'd put that capability to work, but yeah, I don't think we need to hurry.

bhaller commented 3 years ago

@petrelharp we can work on this whenever you like, but I'll need your help on it. Looks like the bug on the tskit side has been fixed, so it should be ready to go. Not sure whether that bug-fix made it into the version of the tskit code that is now in SLiM, or if we'd need to do a new pull.

bhaller commented 3 years ago

If I understand the discussion above, this depends upon simplify() handling individual parents correctly, right? Has that now happened (in the version of the tskit code now in SLiM, which is pretty recent)?

petrelharp commented 3 years ago

It has - that's what https://github.com/tskit-dev/tskit/issues/1136 was about, and it's closed.

petrelharp commented 3 years ago

Let's see - to do this we need to

figure out how to get the individual's parent ID into the tables here
copy over individual parents that are already in the tables here and here and here

In the first place, where we're inserting individuals into the tables, we know the pedigree ID of the individual's parents, right? Then we could either

maintain a vector of pedigree IDs of the individuals currently in the tables; we'd need to update this vector whenever we messed with the tables; or
look through the metadata in the tables on the fly. I propose doing the first thing; so we'd have a global variable like individual_table_pedigree_ids_ and we'd append to it when we append to the table, and then we'd have a function that rebuilds it by iterating over the table and pulling the pedigree ID from metadata whenever we reorder the table (at the other places listed above).

bhaller commented 3 years ago

OK great. I can take a stab at implementing this. If I understand correctly, the strategy would basically be (I see you typed similar stuff, we crossed in transit, we can compare strategies :->):

When we're going to build the individuals table for saving, first build up a hash table containing one entry for each individual, mapping SLiM individual pedigree ID to tskit's row in the individuals table (maybe we make this already)
Then, as we add each individual to the table, get its parent pedigree IDs from SLiM's records, and use the has table to look up their corresponding row #.
If that lookup failed, the focal individual's parent is not in the table; record -1 (?) for that parent. If the lookup succeeded, record the row number for that parent.

bhaller commented 3 years ago

Ah, we don't build the individuals table at save time, that's the population table I'm thinking of. We maintain it by adding remembered individuals, etc., as we go. Then yes, the hash table should be kept maintained in parallel with the individuals table itself. It looks to me like we're on the same page. I'll give it a try tomorrow.

bhaller commented 3 years ago

(Might be simpler to just build the hash table at save time, though, no? Is there any drawback to that really?)

petrelharp commented 3 years ago

(Might be simpler to just build the hash table at save time, though, no? Is there any drawback to that really?)

You mean not fill out individual parents until the tree sequence is written out? But, we don't have the parent's ids any more at that time? You can't stick in pedigreeIDs into the parent column temporarily, either, since simplify will try to remap them...

bhaller commented 3 years ago

(Might be simpler to just build the hash table at save time, though, no? Is there any drawback to that really?)

You mean not fill out individual parents until the tree sequence is written out? But, we don't have the parent's ids any more at that time? You can't stick in pedigreeIDs into the parent column temporarily, either, since simplify will try to remap them...

Hmm. The parent IDs are in their metadata, aren't they? So when it comes time to save, we can build the hash table by scanning through the individuals table as it presently exists, and mapping SLiM pedigree IDs (found in the metadata) to row numbers. Then create the parent column and fill in the appropriate values, at that time. Does that not work for some reason?

petrelharp commented 3 years ago

Hmm. The parent IDs are in their metadata, aren't they?

No? Just the pedigreeIDs, not the parent's IDs? - see here for instance.

bhaller commented 3 years ago

Oh, I see what you mean. Yeah, right. OK. I see. So for this scheme to work, none of the info can be built at save time; we need to keep the hash table of pedigree ID -> tskit row up to date at all times, and keep the parents column in the individuals table correct at all times. That way simplify will correctly translate the parent row numbers across simplifications for us, which is information we would not be able to reconstruct ourselves. So whenever we record a new individual in the table, we need to put the correct parent column info in right at the time. Yes? Or am I still confused?

bhaller commented 3 years ago

Dealing with the reorderings of the individuals table will be rather a pain.

bhaller commented 3 years ago

Going to/from ASCII shouldn't require any intelligence should it? Because that doesn't involve a change of order. Going to ASCII would just convert the parent field values to ASCII, and vice versa from coming from ASCII. So it's only ReorderIndividualTable() that is really a pain in the butt. But tractable, just a pain. OK, I think I understand how to proceed, thanks for the advice. I'll give it a whirl and see if I get stuck. Not sure how I'm going to test it, apart from "does it crash?" :->

petrelharp commented 3 years ago

I don't think that it's that much of a pain, because we just reconstruct the hash table again after it gets reordered, like

for j in seqLen(nrow(individual_table)):
  hash[j] = individual_table[i].metadata['pedigreeID']

petrelharp commented 3 years ago

And, whoops - by "save time" I thought you meant WriteTreeSeq; did you mean RememberIndividuals? On second thought, that might be an even better strategy - just build the hash table at the start of RememberIndividuals and use it within the function.

bhaller commented 3 years ago

And, whoops - by "save time" I thought you meant WriteTreeSeq; did you mean RememberIndividuals? On second thought, that might be an even better strategy - just build the hash table at the start of RememberIndividuals and use it within the function.

Hmm, I think it might be good to zoom about this to arrive at the correct design.

bhaller commented 3 years ago

OK, from zoom. In ReorderIndividualTable() we need to remap the parent column indexes, which we probably already have a map we can use to do; there is no way to just rebuild the parent column without this remapping, since we no longer have parent information for dead individuals stored anywhere else besides the existing parent column. Outside of that method, we will need to rebuild the hash table (from pedigree ID to row number) after simplify, from scratch, which we can do from the pedigree IDs stored in the metadata of the individuals; simplify will take care of remapping the parent column row numbers for us. Otherwise, we will keep the hash table up to date as we add new individuals to the individuals table so we don't have to rebuild the hash table (which is potentially expensive); and we will put correct parent information into the individuals table whenever we add an individual to it. Search for tsk_individual_table to find relevant stuff. Do timing tests to determine the performance impact of all this. Do it as a PR so @petrelharp can review it and test it prior to commit. That's the plan. Did I get that right, @petrelharp?

bhaller commented 3 years ago

@petrelharp wrote in Slack: "The parent array is a ragged array, so we can put in any number of parents. =) I guess I vote to put no parents when there is no parent (eg for initial parents and one parent for clonal models); and you should record the same parent twice in the case of selfing. This way, NULL (ie -1) is reserved for "there was a parent here but it got simplified away"."

bhaller commented 3 years ago

Working on this now. It is possible for an individual to be recorded that has parents, but the parents are not in the individuals table; in that case the parent column should have TSK_NULL, right? So TSK_NULL isn't really reserved for "there was a parent here but it got simplified away", right? More like "a parent existed but we don't know who it was"?

Also: you can record an individual whose parents are not recorded, and then later you record the parents, so the parents actually are now in the table, but the child has TSK_NULL for its parents since they were not recorded at the time it was recorded, right? We don't want to make any attempt to fix that automatically, right? What about if the script then remembers the child again – we update its spatial location and metadata, so we should probably update its parent info too, right?

Right now, AddIndividualsToTable() constructs a new, temporary hash table (tabled_individuals_lookup) that maps from pedigree ID to tskit individuals table row. That is exactly the hash table we're now proposing to add permanently. This makes me realize that we actually have a choice:

We could use tabled_individuals_lookup as it exists now, and not keep a permanent hash table, couldn't we? The benefits of this would be (1) perhaps lower memory usage since the hash table would be constructed on demand (but the high-water mark might be the same), (2) no overhead for re-creating the hash table after simplification. In ReorderIndividualTable() we would construct tabled_individuals_lookup, on demand, then reorder the individuals table, patch up the parent entries, and throw tabled_individuals_lookup away – just as we already do in AddIndividualsToTable(). I feel like you're going to tell me that we don't have the necessary info to do that, which is why we were talking about making a permanent hash table – but don't we have all the info?
We could remove tabled_individuals_lookup in favor of the permanent hash table we've been talking about. This has the benefit that we don't have to create the hash table on demand, over and over each time we need it. However, we need it only when (1) loading, (2) saving, or (3) remembering/retaining individuals, right? Those are all uncommon actions for most models, so I'm not sure we ought to be optimizing for them; keeping simplification as fast as possible seems more important, so avoiding having to rebuild the hash table after each simplification seems like a bigger concern.

Thoughts?

bhaller commented 3 years ago

(On the other hand: (1) the existing build-on-demand code makes AddIndividualsToTable() rather ugly; it would be very nice if AddIndividualsToTable() could just assume the existence of that tables, and it would probably make a substantial positive speed difference for models that do frequently remember individuals, which maybe I am too quick to dismiss as a fringe problem. Snd (2) rebuilding the hash after simplify should be very fast if the individuals table contains nothing but first-gen individuals, so I'm probably worrying about that too much; and if it contains lots of other individuals that makes rebuilding the table slower, that means we're doing a lot of remembering, so the speed benefit of keeping the permanent table should outweigh the speed penalty of rebuilding it after simplification. So, hmm., I'm talking myself into the second option above, especially since it is definitely the cleaner and more aesthetic design...)

bhaller commented 3 years ago

OK, so I decided to go forward with design 2 – the permanent hash table that we've been talking about. It is now being maintained properly as individuals get added, and across simplification, including updating the parents entries when an individual gets re-remembered. The only thing I haven't dealt with yet is ReorderIndividualTable(), since it gets called only on .trees write/read, which I can avoid in my tests. :-> (Also TreeSequenceDataFromAscii() and TreeSequenceDataToAscii() have to be revised to handle the parents column.)

I've got benchmark data from four test models. I'll go through it in detail here, since I think it will be of interest to @petrelharp and @hyanwong. All times are in seconds. Model details are adjusted to emphasize different things while keeping overall runtimes in the range of 60-90 seconds.

Benchmark 1:

initialize() {
    setSeed(0);
    initializeTreeSeq(simplificationRatio=INF);
    initializeMutationRate(0);
    initializeMutationType("m1", 0.5, "f", 0.0);
    initializeGenomicElementType("g1", m1, 1.0);
    initializeGenomicElement(g1, 0, 99999);
    initializeRecombinationRate(1e-8);
}
1 {
    sim.addSubpop("p1", 10000);
}
late() {
    if (sim.generation % 100 == 0)
        catn(sim.generation + "... ");
}
30000 late() { catn(); }

Times (before): 66.15, 65.11, 66.31 Times (after): 64.29, 68.92, 63.88

This benchmark is a tree-seq model with no remembering and no simplifying. Since individuals aren't getting added to the individuals table, the expected impact of my changes is nil, and that's what we see. Basically a control.

Benchmark 2:

initialize() {
    setSeed(0);
    initializeTreeSeq(simplificationRatio=INF);
    initializeMutationRate(0);
    initializeMutationType("m1", 0.5, "f", 0.0);
    initializeGenomicElementType("g1", m1, 1.0);
    initializeGenomicElement(g1, 0, 99999);
    initializeRecombinationRate(1e-8);
}
1 {
    sim.addSubpop("p1", 10000);
}
late() {
    sim.treeSeqRememberIndividuals(p1.individuals);
    if (sim.generation % 100 == 0)
        catn(sim.generation + "... ");
}
500 late() { catn(); }

Times (before): 64.48, 64.84, 64.95 Times (after): 4.39, 4.47, 4.52

This benchmark is very similar to 1, but now we remember every individual in every generation, making the individuals table (a) large and (b) constantly changing. The times above show that the new design is a huge win for models that do a lot of remembering. The old code would rebuild the individuals hash table every generation, inside treeSeqRememberIndividuals(), and the bigger the individuals table gets the longer that takes. Now it is kept up-to-date, which saves all of that work.

Benchmark 3:

initialize() {
    setSeed(0);
    initializeTreeSeq(simplificationInterval=1);
    initializeMutationRate(0);
    initializeMutationType("m1", 0.5, "f", 0.0);
    initializeGenomicElementType("g1", m1, 1.0);
    initializeGenomicElement(g1, 0, 99999);
    initializeRecombinationRate(1e-8);
}
1 {
    sim.addSubpop("p1", 5000);
}
late() {
    if (sim.generation % 100 == 0)
        catn(sim.generation + "... ");
}
20000 late() { catn(); }

Times (before): 87.41, 85.22, 86.80 Times (after): 87.41, 87.57, 86.21

This benchmark is another control: frequent simplification (every generation), but without any remembering. My changes should have little effect here, and that's what we see. This illustrates that the impact of these changes for treeseq models that don't do much remembering ought to be negligible.

Benchmark 4:

initialize() {
    setSeed(0);
    initializeTreeSeq(simplificationInterval=1);
    initializeMutationRate(0);
    initializeMutationType("m1", 0.5, "f", 0.0);
    initializeGenomicElementType("g1", m1, 1.0);
    initializeGenomicElement(g1, 0, 99999);
    initializeRecombinationRate(1e-8);
}
1 {
    sim.addSubpop("p1", 5000);
}
late() {
    sim.treeSeqRememberIndividuals(p1.individuals);
    if (sim.generation % 10 == 0)
        catn(sim.generation + "... ");
}
200 late() { catn(); }

Times (before): 75.56, 76.05, 75.33 Times (after): 79.67, 79.48, 79.38

This benchmark remembers all individuals in each generation, and simplifies in every generation. The times are consistently a little slower "after" than "before" here. Before and after both entail rebuilding the individuals hash table once per generation – for treeSeqRememberIndividuals() in "before", and for simplify in "after". This is why we don't see a speedup like in benchmark 2. So the slowdown of ~4s here seems likely to reflect the actual cost of the new parent recording – of maintaining the hash table when we add new individuals to the table, and of looking up each remembered individual's parents when we record it, and of dealing with the parents column inside simplify itself. This slowdown, which is approximately 5%, is probably a worst-case scenario; we're simplifying every generation so we're not getting any benefits from the new design, only drawbacks.

So, my conclusion is that the new overhead of parent recording is quite small, and will almost always be more than compensated for by speedups due to the new hash table design. So, no need to make this an optional feature, it can be on all the time. I should have it finished fairly soon, and I'll make a PR for review when I do.

bhaller commented 3 years ago

Earlier I wrote:

Also: you can record an individual whose parents are not recorded, and then later you record the parents, so the parents actually are now in the table, but the child has TSK_NULL for its parents since they were not recorded at the time it was recorded, right?

So, I've got crosscheck working with the parents column now (checking that SLiM's record of the parents of an individual match tskit's record). The above issue seems to be easy to trigger in crosscheck, such as with this model:

initialize() {
    initializeSLiMModelType("nonWF");
    initializeTreeSeq(simplificationRatio=INF, runCrosschecks=T);
    defineConstant("K", 500);
    initializeMutationType("m1", 0.5, "f", 0.0);
    m1.convertToSubstitution = T;
    initializeGenomicElementType("g1", m1, 1.0);
    initializeGenomicElement(g1, 0, 99999);
    initializeMutationRate(1e-7);
    initializeRecombinationRate(1e-8);
}
reproduction() {
    subpop.addCrossed(individual, subpop.sampleIndividuals(1));
}
1 early() {
    sim.addSubpop("p1", 10);
}
early() {
    p1.fitnessScaling = K / p1.individualCount;
}
10:15 late() {
    memorable = sample(p1.individuals, 100);
    sim.treeSeqRememberIndividuals(memorable);
}
20 late() {
}

This is just an overlapping-gens nonWF model that chooses a random sample of individuals and remembers them. Just by bad luck, it will often happen to remember an individual A first, and then later remember an individual B that is A's parent but was not previously remembered. Now A's record is incorrect; B is A's parent, and B is in the individuals table, but B is not recorded in the parents column as being the parent of A.

We make no attempt to fix this when it happens, and it would be hard to fix efficiently, since when B is remembered we'd have to go scan through the whole individuals table looking for individuals that have B as a parent but didn't record it because B was not yet remembered. Indeed, this kind of patching up would not even be possible in general, because by the time B gets remembered, its child A might not even be alive any more, and so we might not have SLiM's info indicating that B was actually A's parent. That information has been lost, potentially, by the time B gets remembered.

So... unfortunate. All I can see to do about it is to say "Not a bug; if you want B to be recorded as the parent of A, you have to remember B first, or re-remember A after you remember B." And then relax crosscheck enough that this situation no longer triggers a crosscheck error – unfortunate, since that means crosscheck will also fail to catch some other errors too, but that's life.

But I'm hoping I'm just missing something obvious here...?

ADDENDUM: After pondering this for a while, I realized there is one potential fix that is not unreasonable, and maybe even good. Don't use the parents column at all until save time; instead, put parent pedigree ids into the individuals table metadata column along with the pedigree ids of the individuals themselves. With that information in the table, we could construct the correct parents column info at save time, with a bit of elbow grease. It would take about the same amount of memory (just in metadata instead of parents), it would actually run faster (because simplify wouldn't see the parent information, and wouldn't need to do anything with it at all), and it would fix the bug.

I almost think this is a sign that the design of the parents column in tskit is not optimal. It's capable of representing what we want to represent, but it's not good for recording because of this problem. The fundamental problem is that it uses row numbers to reference the parents of an individual, which means that when you record an individual in the table you can only record its parents if they already have row numbers assigned to them. It might be better if it used more abstract indices for individuals, like SLiM's pedigree IDs, that uniquely exist for every individual whether it has been recorded in the table or not. I don't think tskit has anything like that right now, though, and it might be a rather large paradigm shift. Still, I wonder whether we ought to bring it up with them? How set in stone is the current design of the parent stuff?

Anyway, we don't need tskit to change its design if they don't want to; we can just record parent pedigree ids in the metadata and create the parent column from that info at save time, as outlined above. And although it's annoying, because I'm just about finished with all of this work, I think maybe that would be a better design, and I ought to start over.

petrelharp commented 3 years ago

It is possible for an individual to be recorded that has parents, but the parents are not in the individuals table; in that case the parent column should have TSK_NULL, right? So TSK_NULL isn't really reserved for "there was a parent here but it got simplified away", right? More like "a parent existed but we don't know who it was"?

I agree with this!

We don't want to make any attempt to fix that automatically, right? What about if the script then remembers the child again – we update its spatial location and metadata, so we should probably update its parent info too, right?

And, I agree with this. (... reads more ...) Ah, I see.

Anyway, we don't need tskit to change its design if they don't want to; we can just record parent pedigree ids in the metadata and create the parent column from that info at save time, as outlined above. And although it's annoying, because I'm just about finished with all of this work, I think maybe that would be a better design, and I ought to start over.

That change would be a major change to tskit in general, and could have major speed implications (since to go find the parent you'd have to do a lookup instead of just an index). It's a great point, and maybe means that the optimal internal representation of all this in SLiM is not exactly the tables. But your proposal to stick the parent's pedigreeID in metadata sounds just right. It'd require having two metadata structs - one for internal use, and one for writing out (that removes the parent's pedigree ID... unless you propose to keep that in there permanently?). But all that might be no worse than figuring out how to relax crosscheck, and is cerntainly nicer. I do still think that saying 'if the parent wasn't there when you remembered, then it won't be recorded" is acceptable, but not for any good conceptual reason, so fixing it is probably the way to go. Sigh. Sorry I didn't see that coming.

bhaller commented 3 years ago

I agree with this! And, I agree with this. (... reads more ...) Ah, I see.

OK, great.

...But your proposal to stick the parent's pedigreeID in metadata sounds just right. It'd require having two metadata structs - one for internal use, and one for writing out (that removes the parent's pedigree ID... unless you propose to keep that in there permanently?)... I do still think that saying 'if the parent wasn't there when you remembered, then it won't be recorded" is acceptable, but not for any good conceptual reason, so fixing it is probably the way to go. Sigh. Sorry I didn't see that coming.

Yeah, it wouldn't be the worst thing, but it would surprise people and doubtless cause us heartache down the road. Better to fix it now. No worries, I didn't see it coming either – this sort of thing is hard to anticipate! :->

So the big question is whether to keep this info in metadata only internally, or to add it to the public metadata. I was rather thinking the latter – actually, the former didn't even occur to me :->. It does make the metadata that much larger, but most models remember few or no individuals; the overhead would only affect models that remember most/all individuals. And it could be useful information to have in there; it would be a record of pedigree/parentage that would be more informative than tskit's record in some ways, if simplification gets rid of most individuals. For example, (a) the user could log out information on individuals to a file on the side, using pedigree IDs, and then look up information about the parents of individuals that are retained, using the parental pedigree IDs, or (b) the user could tell "these two individuals shared the same parent" by comparing parental pedigree IDs. These things could be done even if the parents themselves had been simplified away.

Those benefits are pretty marginal, though, I guess. What do you think? You've been doing a lot of "retain every individual" models; would this increase in the metadata size matter to you? Would the additional information be useful to you?

bhaller commented 3 years ago

From a chat: we are agreed that we should add the parent pedigree IDs to the individuals table metadata permanently (i.e., it gets saved to disk), rather than temporarily (i.e., adding it just internally to SLiM and removing it prior to save). It won't take much extra space, might be useful to someone, and this implementation is simpler than modifying the metadata before saving.

bhaller commented 3 years ago

Fixed in https://github.com/MesserLab/SLiM/commit/e0abc33db1a4442ee8dd6bce6d0b66bb53a62eb1 rather than in the PR, due to git headaches. The PR has discussion, though.

MesserLab / SLiM

Should write our pedigree data to .trees files #152