genome id mismatch error with SLiM 4.1

MesserLab / SLiM

SLiM is a genetically explicit forward simulation software package for population genetics and evolutionary biology. It is highly flexible, with a built-in scripting language, and has a cross-platform graphical modeling environment called SLiMgui.

https://messerlab.org/slim/

GNU General Public License v3.0

160 stars 30 forks source link

genome id mismatch error with SLiM 4.1 #419

Closed ShyamieG closed 8 months ago

ShyamieG commented 9 months ago

EDIT: It turns out this error was caused by my script, which re-labels nodes within the tree sequence to eliminate duplicate slim ids. In SLiM 4.1, a mismatch between node slim id and individual pedigree id throws this error.

Hi again,

I'm running into an issue with at a previously error-free step after updating to SLiM 4.1 (specifically, the version that was just patched to fix this issue)

A couple of my steps (i.e. 2 out of >400) now throw this error:

ERROR (Species::__TabulateSubpopulationsFromTreeSequence): genome id mismatch; this file cannot be read.

I've included the input files and code to reproduce the error:

slim -l 0 -seed 15567 -d "IN='H4152'" -d "infile='inf-270_V5490_to_H4152_on_day_9_simplified_annotated.ts'" -d "inf_idx='283'" -d "out=c('V1815')" -d "sampling_events=c(42)" -d "outpath='.'" -d "mu='3.00E-10'" -d "positions_file='PvP01_genome_positions'" -d "N_transmitted='20'" -d "N_reactivated='1'" -d "N_sampled='100'" -d "t_liver_stage='7'" -d "N_liver_final='2.00E+04'" -d "t_to_99_blood_max='7'" -d "t_RBC_period='2'" -d "N_blood_max='1.00E+05'" -d "N_RBC_clones='16'" -d "t_gameto_lifespan='5'" -d "p_RBC_to_gameto='0.01'" -d "t_start_clearance='40'" -d "t_inf_half_life='50'" hhar.slim

In case it is helpful, here is another example that runs without an error:

slim -l 0 -seed 79255 -d "IN='H7005'" -d "infile='inf-9876_V6888_to_H7005_on_day_11_simplified_annotated.ts'" -d "inf_idx='9902'" -d "out=c('V5480')" -d "sampling_events=c(116)" -d "outpath='.'" -d "mu='3.00E-10'" -d "positions_file='PvP01_genome_positions'" -d "N_transmitted='20'" -d "N_reactivated='1'" -d "N_sampled='100'" -d "t_liver_stage='7'" -d "N_liver_final='2.00E+04'" -d "t_to_99_blood_max='7'" -d "t_RBC_period='2'" -d "N_blood_max='1.00E+05'" -d "N_RBC_clones='16'" -d "t_gameto_lifespan='5'" -d "p_RBC_to_gameto='0.01'" -d "t_start_clearance='40'" -d "t_inf_half_life='50'" hhar.slim

Any idea why this might be happening? Thanks!

bhaller commented 9 months ago

Hi @ShyamieG! Thanks for the report. I'm going to ping @petrelharp here, since this one does look like a tree-sequence recording issue that I might need help with. :->

It looks to me like the error message is correct. There is a situation in the input tree-sequence file where the node metadata has a genome ID of 9071947, but the corresponding individual's metadata has a pedigree ID of 4532118. The nodes of an individual are supposed to have genome IDs that are 2n and 2n+1, where n is the individual's pedigree ID. That relationship between these IDs is set up by SLiM when the nodes and the individual are created, and they should always be true. 2*4532118 is 9064236, so that relationship is not preserved in your input tree sequence file.

So, the question goes down a level, to: where did this tree-sequence file come from, and how was it produced? It appears to be corrupted, in the sense that its metadata is not correct.

petrelharp commented 9 months ago

I agree - that's the question. I had a look at the tree sequence in question and didn't spot anything useful or suspicious other than that.

bhaller commented 9 months ago

I agree - that's the question. I had a look at the tree sequence in question and didn't spot anything useful or suspicious other than that.

Can you confirm the metadata mismatch directly, outside of SLiM's reading code? I.e., can you extract node and individual metadata that ought to correspond, in Python, and confirm that the input file is indeed corrupted? Because another possibility is that the file is fine but SLiM is somehow getting confused and screwing up the IDs.

petrelharp commented 9 months ago

Can you confirm the metadata mismatch directly, outside of SLiM's reading code?

Yes, I did that - and, there's only the one individual that's messed up, oddly.

ShyamieG commented 9 months ago

So, the question goes down a level, to: where did this tree-sequence file come from, and how was it produced? It appears to be corrupted, in the sense that its metadata is not correct.

Below is the code to reproduce the entire sequence up to and including the error:

# Seed step
start=$SECONDS
slim -l 0 -seed 6176 -d "IN='Hseed'" -d "infile='NULL'" -d "inf_idx='H-seed'" -d "out=c('V1204','V6043')" -d "sampling_events=c(17,34)" -d "outpath='.'" \
                                  -d "N_founders=20" \
                                  -d "mu='3.00E-10'" \
                                  -d "positions_file='PvP01_genome_positions'" \
                                  -d "N_transmitted='20'" \
                                  -d "N_reactivated='1'" \
                                  -d "N_sampled='100'" \
                                  -d "t_liver_stage='7'" \
                                  -d "N_liver_final='2.00E+04'" \
                                  -d "t_to_99_blood_max='7'" \
                                  -d "t_RBC_period='2'" \
                                  -d "N_blood_max='1.00E+05'" \
                                  -d "N_RBC_clones='16'" \
                                  -d "t_gameto_lifespan='5'" \
                                  -d "p_RBC_to_gameto='0.01'" \
                                  -d "t_start_clearance='40'" \
                                  -d "t_inf_half_life='50'" \
                                  hhar.slim
python simplify_trees.py inf-H-seed_Hseed_to_V1204_on_day_17.ts inf-H-seed_Hseed_to_V6043_on_day_34.ts
end=$SECONDS
time_elapsed=`echo $end - $start | bc`
echo "hhar run took $time_elapsed seconds"

# inf_61
start=$SECONDS
python annotate_nodes.py --file "inf-H-seed_Hseed_to_V1204_on_day_17_simplified.ts" --inf_id "H-seed" --simulation_tag "15"
slim -l 0 -seed 13926 -d "IN='V1204'" -d "infile='inf-H-seed_Hseed_to_V1204_on_day_17_simplified_annotated.ts'" -d "inf_idx='61'" -d "out=c('H7058','H7295')" -d "sampling_events=c(13,35)" -d "outpath='.'" \
                                  -d "mu='3.00E-10'" \
                                  -d "positions_file='PvP01_genome_positions'" \
                                  -d "rates_file='vivax_genome_rates'" \
                                  -d "N_transmitted='20'" \
                                  -d "p_males='0.25'" \
                                  -d "p_zygotes='0.1'" \
                                  -d "N_sporozoites='4000'" \
                                  -d "p_gut_saliva='0.2'" \
                                  -d "N_sampled='100'" \
                                  mdsr.slim
python simplify_trees.py inf-61_V1204_to_H7058_on_day_13.ts inf-61_V1204_to_H7295_on_day_35.ts
end=$SECONDS
time_elapsed=`echo $end - $start | bc`
echo "mdsr run took $time_elapsed seconds"

# inf_135
start=$SECONDS
python annotate_nodes.py --file "inf-61_V1204_to_H7295_on_day_35_simplified.ts" --inf_id "61" --simulation_tag "15"
slim -l 0 -seed 15464 -d "IN='H7295'" -d "infile='inf-61_V1204_to_H7295_on_day_35_simplified_annotated.ts'" -d "inf_idx='135'" -d "out=c('V9260','V5490')" -d "sampling_events=c(21,111)" -d "outpath='.'" \
                                  -d "mu='3.00E-10'" \
                                  -d "positions_file='PvP01_genome_positions'" \
                                  -d "N_transmitted='20'" \
                                  -d "N_reactivated='1'" \
                                  -d "N_sampled='100'" \
                                  -d "t_liver_stage='7'" \
                                  -d "N_liver_final='2.00E+04'" \
                                  -d "t_to_99_blood_max='7'" \
                                  -d "t_RBC_period='2'" \
                                  -d "N_blood_max='1.00E+05'" \
                                  -d "N_RBC_clones='16'" \
                                  -d "t_gameto_lifespan='5'" \
                                  -d "p_RBC_to_gameto='0.01'" \
                                  -d "t_start_clearance='40'" \
                                  -d "t_inf_half_life='50'" \
                                  hhar.slim
python simplify_trees.py inf-135_H7295_to_V9260_on_day_21.ts inf-135_H7295_to_V5490_on_day_111.ts
end=$SECONDS
time_elapsed=`echo $end - $start | bc`
echo "hhar run took $time_elapsed seconds"

# inf_270
start=$SECONDS
python annotate_nodes.py --file "inf-135_H7295_to_V5490_on_day_111_simplified.ts" --inf_id "135" --simulation_tag "15"
slim -l 0 -seed 13684 -d "IN='V5490'" -d "infile='inf-135_H7295_to_V5490_on_day_111_simplified_annotated.ts'" -d "inf_idx='270'" -d "out=c('H4152')" -d "sampling_events=c(9)" -d "outpath='.'" \
                  -d "mu='3.00E-10'" \
                  -d "positions_file='PvP01_genome_positions'" \
                  -d "rates_file='vivax_genome_rates'" \
                  -d "N_transmitted='20'" \
                  -d "p_males='0.25'" \
                  -d "p_zygotes='0.1'" \
                  -d "N_sporozoites='4000'" \
                  -d "p_gut_saliva='0.2'" \
                  -d "N_sampled='100'" \
                  mdsr.slim
python simplify_trees.py inf-270_V5490_to_H4152_on_day_9.ts
end=$SECONDS
time_elapsed=`echo $end - $start | bc`
echo "mdsr run took $time_elapsed seconds"

# inf_283
python annotate_nodes.py --file "inf-270_V5490_to_H4152_on_day_9_simplified.ts" --inf_id "270" --simulation_tag "15"
slim -l 0 -seed 15567 -d "IN='H4152'" -d "infile='inf-270_V5490_to_H4152_on_day_9_simplified_annotated.ts'" -d "inf_idx='283'" -d "out=c('V1815')" -d "sampling_events=c(42)" -d "outpath='.'" \
          -d "mu='3.00E-10'" \
          -d "positions_file='PvP01_genome_positions'" \
          -d "N_transmitted='20'" \
          -d "N_reactivated='1'" \
          -d "N_sampled='100'" \
          -d "t_liver_stage='7'" \
          -d "N_liver_final='2.00E+04'" \
          -d "t_to_99_blood_max='7'" \
          -d "t_RBC_period='2'" \
          -d "N_blood_max='1.00E+05'" \
          -d "N_RBC_clones='16'" \
          -d "t_gameto_lifespan='5'" \
          -d "p_RBC_to_gameto='0.01'" \
          -d "t_start_clearance='40'" \
          -d "t_inf_half_life='50'" \
          hhar.slim
python simplify_trees.py inf-283_H4152_to_V1815_on_day_42.ts

This represents 5 SLiMulations that are linked by passing along an evolving tree sequence from one to the next. Within 3 of the 5 SLiM steps, the infection (pathogen population) splits into two lineages. I wrote the annotate_nodes.py and simplify_trees.py scripts to keep track of all the information I would need to be able to merge all the tree sequences back together at the end of the entire set of runs - this sequence follows just one of the paths.

I have already had a lot of help from Jerome Kelleher and @petrelharp to figure out how to go about doing this - see here, here and here for just a few examples. Briefly, the annotation step creates/maintains a dictionary that uniquely maps each node's SLiM id to the specific step in which it was created. Given that SLiM ids can be duplicated across parallel runs, this allows me create an accurate node map needed for tskit's merge() function. I created the 'simplification' step (a bit unfortunately named) because I needed a way to keep track of nodes that might get simplified out by SLiM because they are unnecessary to represent a particular tree sequence, but that ARE necessary to correctly represent its relationship to other related tree sequences in the full simulation (for example, when multiple infections branch off from a common host). I don't want to get too in the weeds here, but did want to give you some context for what the python scripts are doing in case those are messing something up.

I hope this is helpful, and I'm happy to provide any other info you might need. Thanks also for your response to #416 - I will work through your suggestions and get back to you on all of that.

EDIT - Forgot to attach the files!

petrelharp commented 9 months ago

Ah - perhaps you could check after each step to see when exactly the bad individuals come in? Here's some code to find such individuals:

bad = []

for i in ts.individuals():
    for ni in i.nodes:
        n = ts.node(ni)
        if n.metadata['slim_id'] // 2 != i.metadata['pedigree_id']:
            bad.append((i, n))

# # to check for an error::
# assert len(bad) == 0

for i, n in bad:
    print("---------------")
    print(i)
    print(n)

ShyamieG commented 9 months ago

Thanks for the suggestion @petrelharp - that was very helpful! The problem is coming in at the 'simplify_trees.py' step right before the SLiM run that throws the error. So it is indeed my python script messing things up. I will take a closer look tomorrow.

Out of curiosity, any ideas about why I wasn't getting this error with SLiM 4.0.1?

bhaller commented 9 months ago

Out of curiosity, any ideas about why I wasn't getting this error with SLiM 4.0.1?

Well, hmm. I don't remember making the error checking during tree-seq loading stricter, but from 4.0.1 to 4.1 is more than a year of accumulated work, so perhaps I did and I have forgotten. So that's one possibility. If you updated msprime/pyslim/tskit at the same time that you updated to SLiM 4.1, then perhaps a change in those packages is responsible. I don't know. Looks like the check that is raising has been there since 2018.

ShyamieG commented 8 months ago

Ahhh, okay. My simplify_trees.py script does re-label nodes if there are duplicate SLiM ids in the tree sequence, so that is certainly what is causing this issue. I didn't realize that the node metadata slim_id and the individual pedigree_id had this relationship and that it was important. It seems that some error checking step was added between 4.0.1 and 4.1 to catch the discrepancy.

In the problematic tree sequence, I have two sets of nodes with the same SLiM ids. One set is associated with an individual, while the other is not:

>>> print(tree.tables.nodes[822,823,9284,9285])
╔══╤═════╤══════════╤══════════╤════╤════════════════════════════════════════╗
║id│flags│population│individual│time│metadata                                ║
╠══╪═════╪══════════╪══════════╪════╪════════════════════════════════════════╣
║0 │    1│         6│        21│   7│{'slim_id': 9064236, 'is_null': False...║
║1 │    1│         6│        21│   7│{'slim_id': 9064237, 'is_null': True,...║
║2 │    0│         0│        -1│ 117│{'slim_id': 9064237, 'is_null': False...║
║3 │    0│         0│        -1│ 117│{'slim_id': 9064236, 'is_null': True,...║
╚══╧═════╧══════════╧══════════╧════╧════════════════════════════════════════╝

If I am understanding everything correctly, it seems that the pedigree id was recycled at some time after the original individual it was assigned to was dropped? If that is the case, is there any way to avoid this to ensure that all SLiM ids within a tree sequence are unique?

Thanks as always for your insight - this has been very helpful.

bhaller commented 8 months ago

If I am understanding everything correctly, it seems that the pedigree id was recycled at some time after the original individual it was assigned to was dropped? If that is the case, is there any way to avoid this to ensure that all SLiM ids within a tree sequence are unique?

Pedigree IDs will be unique within any one simulation in SLiM. If you are joining simulations together, or tricky things like that, then you do need to ensure that they use unique IDs (and that the node/individual relationship is preserved). I think pyslim has an API for finding out the last used pedigree ID, so that you can resume at that point, but I'm not seeing it in the pyslim doc at https://tskit.dev/pyslim/docs/stable/python_api.html; @petrelharp sorry, can you tell me this again?

Once this issue is resolved, please make it clear on #416 whether that issue is also resolved by this issue. Thanks!

petrelharp commented 8 months ago

There's no such API - we do have the function that tells us the largest mutation ID, not individual (pedigree) ID.

I don't have the big picture of what's going on in your script to advise on how to join things together correctly? (I know that maybe I knew before, but not any more...)

bhaller commented 8 months ago

There's no such API - we do have the function that tells us the largest mutation ID, not individual (pedigree) ID.

Oh, right. Is that in the pyslim docs that I linked above? I don't see it there. If I had seen it in the docs, it probably would've jogged my memory on this...

ShyamieG commented 8 months ago

There's no such API - we do have the function that tells us the largest mutation ID, not individual (pedigree) ID.

Oh, right. Is that in the pyslim docs that I linked above? I don't see it there. If I had seen it in the docs, it probably would've jogged my memory on this...

this? https://tskit.dev/pyslim/docs/latest/python_api.html#pyslim.next_slim_mutation_id

ShyamieG commented 8 months ago

There's no such API - we do have the function that tells us the largest mutation ID, not individual (pedigree) ID.

I don't have the big picture of what's going on in your script to advise on how to join things together correctly? (I know that maybe I knew before, but not any more...)

In the example posted, I'm just chaining together a series of SLiM simulations - except for the first SLiMulation, each step reads in the tree sequence generated by the step before it and proceeds from there.

The issue I'm seeing is that sometimes there are duplicate SLiM ids within the same tree sequence. In the examples I've looked at, there is always one set of SLiM ids associated with an individual while the other set is not:

╔══╤═════╤══════════╤══════════╤════╤════════════════════════════════════════╗
║id│flags│population│individual│time│metadata                                ║
╠══╪═════╪══════════╪══════════╪════╪════════════════════════════════════════╣
║0 │    1│         7│      3660│  79│{'slim_id': 51734614, 'is_null': Fals...║
║1 │    1│         7│      3660│  79│{'slim_id': 51734615, 'is_null': True...║
║2 │    1│         7│      3661│  79│{'slim_id': 51734862, 'is_null': Fals...║
║3 │    1│         7│      3661│  79│{'slim_id': 51734863, 'is_null': True...║
║4 │    1│         0│        -1│  61│{'slim_id': 51734862, 'is_null': Fals...║
║5 │    1│         0│        -1│  61│{'slim_id': 51734863, 'is_null': True...║
║6 │    1│         0│        -1│  62│{'slim_id': 51734614, 'is_null': Fals...║
║7 │    1│         0│        -1│  62│{'slim_id': 51734615, 'is_null': True...║
╚══╧═════╧══════════╧══════════╧════╧════════════════════════════════════════╝

This is not a problem for me, per se, but I would have to update my workflow to work around it as currently it assumes that SLiM ids within a given tree sequence are unique. This is why I previously had a step to re-label duplicate nodes (which ended up corrupting my ts file). However, if you can think of some way to ensure that SLiM ids within a tree sequence are unique, I'd be open to trying that first.

In the example above, I don't understand how the second set of nodes (with times of 61/62) could have been assigned the same slim ids as those born before them (with time 79). The corresponding pedigree ids were already assigned to individuals 3660 and 3661:

╔══╤══════╤═════════════╤══════════╤════════════════════════════════════════╗
║id│flags │location     │parents   │metadata                                ║
╠══╪══════╪═════════════╪══════════╪════════════════════════════════════════╣
║0 │393216│0.0, 0.0, 0.0│3672, 3672│{'pedigree_id': 25867307, 'pedigree_p...║
║1 │393216│0.0, 0.0, 0.0│3751, 3751│{'pedigree_id': 25867431, 'pedigree_p...║
╚══╧══════╧═════════════╧══════════╧════════════════════════════════════════╝

Those individuals still exist in the tree sequence, and yet their SLiM ids were re-used for nodes that are not associated with an individual at all...

bhaller commented 8 months ago

There's no such API - we do have the function that tells us the largest mutation ID, not individual (pedigree) ID.

Oh, right. Is that in the pyslim docs that I linked above? I don't see it there. If I had seen it in the docs, it probably would've jogged my memory on this...

this? https://tskit.dev/pyslim/docs/latest/python_api.html#pyslim.next_slim_mutation_id

Ah, there it is. I scrolled through the whole document looking for it, but I guess I scrolled right over it. I also searched for "last" but it starts with "next", lol. @petrelharp is there any particular reason that you don't put all of the APIs in the API summary at the top of the page, but only a subset of them? I looked in that table, too, and didn't see the method there. :-O

As for the rest of it, let's see what Peter has to say about it; I'm not sure what's going on.

bhaller commented 8 months ago

Ah, wait, I have an idea. OK, how about this, @petrelharp? She's loading in a previous .trees file, which has some ancestral genomes that have a genome pedigree ID but no corresponding individual in the individuals table, I guess. The code that reads in a tree sequence keeps track of the largest individual pedigree ID seen (at species.cpp:7754), so that we start using new pedigree IDs after that number. But if there is a genome pedigree ID that is larger (after being divided by two and rounding down) than the largest individual pedigree ID in the file, we don't notice that. We'd start at one plus the largest individual pedigree ID, and end up re-using the old genome pedigree ID eventually.

I'm surprised this has never bitten us before; this bug has been around from the very start, I think. For some reason, I think we assumed that the largest individual ID would be >= the largest corresponding genome ID, but I guess that is not necessarily true? (Pondering it, I bet we assumed that when we implemented tree-sequence recording for WF models, and then didn't revisit the assumption when we implemented it for nonWF models?) When exactly does this circumstance arise? Anyhow, this seems very likely to be the bug, right? Happily, it's an easy fix. I'll work on it.

petrelharp commented 8 months ago

Um, this sounds right but I don't actually see how it could happen - for this to be the problem, we need an ancestral individual with individual ID n such that:

the oldest individual alive at the end of the simulation has ID less than n, and
there are individuals alive at the end of the simulation that are descended from individual n.

Point (1) ensures that when we restart, individual ID n is available for re-use, but point (2) ensures that the individual's genomes are still present (or at least might be, if one is a coalescent node).

However, I don't think that's possible, since any descendants of n must have ID greater than n? I feel like I'm missing something?

The check you propose sounds like an important check to make - tree sequences provided from other sources wouldn't necessarily have this property - but I'm not convinced yet it's what's going on here.

bhaller commented 8 months ago

Yes, I'm having trouble visualizing exactly when this condition would occur, but perhaps it does somehow. No harm in adding the code, I guess. Certainly it seems like something is going wrong with the "next pedigree ID" mechanism, and this is the only thing I can think of...

bhaller commented 8 months ago

OK, in PR #420 I have now added a check for node pedigree IDs that are higher than expected. At this point, @ShyamieG, can you switch to the branch for the PR (the branch is named check_node_ids) and re-run your process from the start, using SLiM build from that branch? I'd like to know whether SLiM ever, in fact, notices this condition occurring (in which case it will emit a #WARNING line to its output, #WARNING (Species::__CheckNodePedigreeIDs): in reading the tree sequence, a node was encountered with a genome pedigree ID that was (after division by 2) greater than or equal to the next pedigree ID; this is not expected to happen.), and whether the condition is triggered simply by feeding saved tree sequences back into SLiM, or is only triggered when your tree-sequence modification code is being used. If it only occurs when your script is being used to modify the tree sequence, then I think I will make this condition be an error, rather than a warning, since @petrelharp and I can't think of how it would "naturally" occur. Thanks for your patience as we work through these things one by one; I guess your approach here is poking SLiM in a variety of ways that it is not usually poked!

bhaller commented 8 months ago

Um, this sounds right but I don't actually see how it could happen - for this to be the problem, we need an ancestral individual with individual ID n such that:

the oldest individual alive at the end of the simulation has ID less than n, and

there are individuals alive at the end of the simulation that are descended from individual n.

Point (1) ensures that when we restart, individual ID n is available for re-use, but point (2) ensures that the individual's genomes are still present (or at least might be, if one is a coalescent node).

However, I don't think that's possible, since any descendants of n must have ID greater than n? I feel like I'm missing something?

Well, it is possible to call treeSeqOutput() requesting no simplification, right? Perhaps that changes the picture here, since the putative individual would then not be simplified away? I'm not sure that possibility is relevant in this particular case (I haven't gone back to look at @ShyamieG's scripts to see if she's using that flag, but she hasn't said that she is, I don't think?). But it's relevant in the big picture in any case.

The bigger point being: for the final disposition of the PR I have started, we need to decide whether this condition ought to be a warning, with a fix to make the "next pedigree ID" large enough to avoid a collision, or ought to be an error because it always indicates that the tree sequence has been messed with in some way that has potentially corrupted it (as I guess @ShyamieG's script is doing). What do you think?

It would be nice to have a more complete check for corruption of the tree-sequence data – in particular, checking that no genome ID is shared by more than one genome, and no individual ID is shared by more than one individual – but that check would be pretty time-consuming. I could do it for DEBUG builds only, if you think it is worthwhile. Not sure how many hoops we ought to jump through to check for corrupted tree-sequence data, though – there are many, many ways that that data could be corrupted, I suppose, and checking for all those possibilities is not really SLiM's problem. (Maybe a new utility function in tskit would be useful, and SLiM could call that function when in DEBUG mode? Is that worth a new issue on tskit, do you think?)

The check you propose sounds like an important check to make - tree sequences provided from other sources wouldn't necessarily have this property - but I'm not convinced yet it's what's going on here.

Yes, I agree. It sounds like maybe the tree sequence being loaded has been corrupted by stuff done on the Python side. What I'm wondering now is what the best way to react to that fact is (if it is a fact). :->

ShyamieG commented 8 months ago

Well, it is possible to call treeSeqOutput() requesting no simplification, right? Perhaps that changes the picture here, since the putative individual would then not be simplified away? I'm not sure that possibility is relevant in this particular case (I haven't gone back to look at @ShyamieG's scripts to see if she's using that flag, but she hasn't said that she is, I don't think?). But it's relevant in the big picture in any case.

Just chiming in to say that I do explicitly set the simplify flag in treeSeqOutput() to 'T', but I think that is just the default behaviour. Not sure if/how that changes things.

bhaller commented 8 months ago

Well, it is possible to call treeSeqOutput() requesting no simplification, right? Perhaps that changes the picture here, since the putative individual would then not be simplified away? I'm not sure that possibility is relevant in this particular case (I haven't gone back to look at @ShyamieG's scripts to see if she's using that flag, but she hasn't said that she is, I don't think?). But it's relevant in the big picture in any case.

Just chiming in to say that I do explicitly set the simplify flag in treeSeqOutput() to 'T', but I think that is just the default behaviour. Not sure if/how that changes things.

Thanks. Yes, it would be if you set it to F – not simplifying on save – that maybe there would be more of a possibility for this pedigree ID weirdness to happen naturally. With the right calls to killIndividuals() and such, maybe. But yes, your model isn't doing that.

petrelharp commented 8 months ago

I don't think that simplification or not affects my argument above that such a situation should not ever be produced by SLiM. Once you start merging tree sequence files in python, however, it could totally happen.

Whether or not to make it an error: two questions:

would it lead to errors in SLiM?
would it tend to lead to errors in downstream analysis?

I don't think that (1) would happen, since SLiM does nothing with the genomes that are not in individuals (and does nothing with genomes not in alive individuals). So, that makes the argument to make it a Warning only. However, I think the possibility of (2) is fairly serious, as people will generally assume that the genome IDs are unique. A Warning seems like it would be sufficient, but an Error is safer. (It also protects us against deciding to use genome IDs in the future.)

This would be easier to have an opinion on if we knew exactly how the duplication is getting in there.

petrelharp commented 8 months ago

A "tree sequence validation for SLiM" method is also a good idea: https://github.com/tskit-dev/pyslim/issues/62

ShyamieG commented 8 months ago

I don't think that (1) would happen, since SLiM does nothing with the genomes that are not in individuals (and does nothing with genomes not in alive individuals). So, that makes the argument to make it a Warning only. However, I think the possibility of (2) is fairly serious, as people will generally assume that the genome IDs are unique. A Warning seems like it would be sufficient, but an Error is safer. (It also protects us against deciding to use genome IDs in the future.)

Chiming in again to clarify some things - may or may not be relevant to your discussion.

At this point, I'm not doing any tree sequence merging (I have also asked you a lot about that, but I'm not at that stage here). One tree sequence file is being passed from one SLiM script to another via readFromPopulationFile() and treeSeqOutput(). I think all the issues I've been having stem from what is simplify_trees.py is doing in between SLiM steps.

petrelharp commented 8 months ago

I'm trying to replicate this locally, and am having trouble - running the code above to produce the tree sequences, and am not being able to - but, I might not have the correct versions of the scripts? (It's a bit confusing.) Could you advise on how to replicate with the updated SLiM?

ShyamieG commented 8 months ago

Sorry, the code above is only a snippet of the entire sequence and only goes as far as needed to replicate the original issue (which is now resolved).

Attached is a longer snippet (plus the updated scripts) that gets to the point where the duplicate slim ids crop up. It takes a while to run, so I will attach the output files as well once I've finished running it.

EDIT: Included the wrong version of simplify_trees.py - these are the correct scripts.

EDIT again: Just kidding, it's these

ShyamieG commented 8 months ago

Here are all the output files that the code above generates:

output1.zip output2.zip output3.zip output4.zip output5.zip

bhaller commented 8 months ago

Hi folks. OK, I have merged the PR. @ShyamieG, I recommend that you now work on the head of the master branch. It has all of the fixes resulting from these issues, including the segfault fix, handling the case of a genome ID (divided by 2) being greater than any individual ID in the tree sequence (which was due to a problem with your Python script, I guess, but which it turns out can also occur naturally), and checking for duplicate genome pedigree IDs (which perhaps would have given you a better error message than the out-of-memory error you were getting, as a result of the other problems). I'm going to close this issue now. Thanks for the very useful report, and please let us know if you hit any further problems!