add node pedigree ID check

bhaller commented 8 months ago

Related to #416 and #419, this PR adds a check, on tree sequence load, for nodes with pedigree IDs that (once divided by two to get their corresponding individual pedigree ID) are higher than any individual pedigree ID (meaning that the next pedigree ID used needs to be higher). At present this PR checks for this condition, emits a warning if it sees it, and fixes the problem by setting the next pedigree ID to be sufficiently high. We might want to change this to be an error condition instead, if we can't think of any case when it would legitimately occur.

bhaller commented 8 months ago

OK @petrelharp, it is very easy to make this happen with a pure SLiM model:

initialize() {
    initializeSLiMModelType("nonWF");
    initializeTreeSeq();
    initializeMutationType("m1", 0.5, "f", 0.0);
    initializeGenomicElementType("g1", m1, 1.0);
    initializeGenomicElement(g1, 0, 99999);
    initializeMutationRate(1e-4);
    initializeRecombinationRate(1e-4);
}
1 late() {
    sim.addSubpop("p1", 1);
}
reproduction() {
    subpop.addCloned(p1.individuals);
}
2 late() {
    // The goal is to end up with genomes in the tree sequence that are more
    // recent (with a larger pedigree ID) than any extant individual.  So we
    // just kill the offspring; its genomes are still in the tree sequence
    // since we don't simplify on save, and they are younger than the extant
    // individual.
    offspring = p1.subsetIndividuals(maxAge=0);
    sim.killIndividuals(offspring);

    sim.treeSeqOutput("~/Desktop/test_ts.trees", simplify=F);
}

You don't even need to do rememberIndividuals(). The only thing needed is to kill the offspring generation, and save with simplify=F. You then load it back in like so:

initialize() {
    initializeSLiMModelType("nonWF");
    initializeTreeSeq();
    initializeMutationType("m1", 0.5, "f", 0.0);
    initializeGenomicElementType("g1", m1, 1.0);
    initializeGenomicElement(g1, 0, 99999);
    initializeMutationRate(1e-4);
    initializeRecombinationRate(1e-4);
}
2 late() {
    sim.readFromPopulationFile("~/Desktop/test_ts.trees");
}

With the code in this PR that does the ID check, you then get:

#WARNING (Species::__CheckNodePedigreeIDs): in reading the tree sequence, a node was encountered with a genome pedigree ID that was (after division by 2) greater than or equal to the next pedigree ID; this is not expected to happen.

So it's not an error condition. At worst it's a warning; I'm not even sure we ought to do that, really. The argument for giving the warning would be that this is expected to be rare, and so it is more likely to be a problem than to be legitimate. I guess using simplify=F is pretty rare, so I guess I'll keep it as a warning? What do you think? I'm tempted to just say "There is nothing wrong, so it shouldn't warn." Anyhow, with the code in this PR we now fix things so the next pedigree ID used by the simulation is correct, so if this is the only "problem" with the file (which isn't truly a problem at all), things will now work correctly in this case.

bhaller commented 8 months ago

Added the check for duplicate genome pedigree IDs.

Changed the warning message for the genome-id-unexpectedly-large case to read: "#WARNING (Species::__CheckNodePedigreeIDs): in reading the tree sequence, a node was encountered with a genome pedigree ID that was (after division by 2) greater than the largest individual pedigree ID in the tree sequence. This is not necessarily an error, but it is highly unusual, and could indicate that the tree sequence file is corrupted. It typically only happens if an unsimplified tree sequence is being loaded, and even then only under certain conditions."

This PR is done, except for decided whether we want that to be a warning at all. Oh, and the duplicate pedigree ID case is completely untested. @petrelharp please review, thanks; and perhaps, in this PR or a new one, you could add a test for the duplicate individual pedigree ID and the duplicate genome pedigree ID cases, just to make sure they both work? I guess you'll need to construct those pathological tree sequence files in Python, they should be impossible to make in SLiM. :->

petrelharp commented 8 months ago

A warning seems fine. How about:

in reading the tree sequence, a node was encountered with a genome pedigree ID that was (after division by 2) greater than the largest individual pedigree ID in the tree sequence.  This is not necessarily an error, but it is highly unusual, and could indicate that the tree sequence file is corrupted.  It may happen due to external manipulations of a tree sequence, or  perhaps if an unsimplified tree sequence produced by SLiM is being loaded.

petrelharp commented 8 months ago

I guess you'll need to construct those pathological tree sequence files in Python, they should be impossible to make in SLiM.

How do you want these? (python code or .trees file?)

bhaller commented 8 months ago

I guess you'll need to construct those pathological tree sequence files in Python, they should be impossible to make in SLiM.

How do you want these? (python code or .trees file?)

I was thinking they'd go into the tests that you maintain, not into my unit tests, no? My tests don't have any way to incorporate either Python code or a .trees file.

MesserLab / SLiM

add node pedigree ID check #420