MesserLab / SLiM

SLiM is a genetically explicit forward simulation software package for population genetics and evolutionary biology. It is highly flexible, with a built-in scripting language, and has a cross-platform graphical modeling environment called SLiMgui.
https://messerlab.org/slim/
GNU General Public License v3.0
160 stars 30 forks source link

error adding individuals to table #416

Closed ShyamieG closed 8 months ago

ShyamieG commented 9 months ago

I'm running into a strange issue running this script (input files are attached here):

slim -l 0 -seed 14081 -d "IN='H5834'" -d "infile='inf-1852_V8157_to_H5834_on_day_23_simplified_annotated.ts'" -d "inf_idx='1893'" -d "out=c('V1945','V105','sample')" -d "sampling_events=c(21,28,61)" -d "outpath='.'" -d "mu='3.00E-10'" -d "positions_file='PvP01_genome_positions'" -d "N_transmitted='20'" -d "N_reactivated='1'" -d "N_sampled='100'" -d "t_liver_stage='7'" -d "N_liver_final='2.00E+04'" -d "t_to_99_blood_max='7'" -d "t_RBC_period='2'" -d "N_blood_max='1.00E+05'" -d "N_RBC_clones='16'" -d "t_gameto_lifespan='5'" -d "p_RBC_to_gameto='0.01'" -d "t_start_clearance='40'" -d "t_inf_half_life='50'" hhar.slim

When running this on my institution's cluster, I get a tskit out of memory error:

Error:tsk_individual_table_add_row: Out of memory. (TSK_ERR_NO_MEMORY)

However, when I moved the input files to my local machine to better troubleshoot, I saw a different error:

Assertion failed: (p_tables->nodes.individual[ind->genome1_->tsk_node_id_] == (tsk_id_t)tsk_individual), function AddIndividualsToTable, file /Users/bhaller/Desktop/SLiM/core/species.cpp, line 4970.
zsh: abort      slim -l 0 -seed 14081 -d "IN='H5834'" -d  -d "inf_idx='1893'" -d  -d  -d  -d

I sought help from my institution's IT department to troubleshoot the error on the cluster, but wasn't able to solve it (increasing the memory request did not help). Have you encountered this kind of issue before?

Here is an example using the same underlying slim script (but different input file) without issue:

slim -l 0 -seed 215 -d "IN='H265'" -d "infile='inf-1602_V8865_to_H265_on_day_49_simplified_annotated.ts'" -d "inf_idx='1686'" -d "out=c('V4533','V4861','sample')" -d "sampling_events=c(39,162,168)" -d "outpath='.'" -d "mu='3.00E-10'" -d "positions_file='PvP01_genome_positions'" -d "N_transmitted='20'" -d "N_reactivated='1'" -d "N_sampled='100'" -d "t_liver_stage='7'" -d "N_liver_final='2.00E+04'" -d "t_to_99_blood_max='7'" -d "t_RBC_period='2'" -d "N_blood_max='1.00E+05'" -d "N_RBC_clones='16'" -d "t_gameto_lifespan='5'" -d "p_RBC_to_gameto='0.01'" -d "t_start_clearance='40'" -d "t_inf_half_life='50'" hhar.slim

For context, I'm using SLiM/tskit to chain together a series of simulations where the output of one step becomes the input of another, reproducing an infection spreading through a host and vector population. I wasn't sure if I should tag the tskit devs here, but happy to if that would be helpful.

bhaller commented 9 months ago

Hi @ShyamieG,

My guess is that the cluster error is accurate: you're running out of memory. It's hard to tell how big your simulation is, off-hand, given the complexity of the script – how large of a genome, how large of a population, how high of a mutation rate, how often tree-sequence simplification is occurring, etc. Perhaps you raised the memory allocated, but it still simply runs out? Another strong possibility is that you're being bitten by a memory leak bug in SLiM. There is this bug, for example: https://github.com/MesserLab/SLiM/issues/404. A fix for that will be released in SLiM 4.1, which is hopefully being released later today (!). You don't say what version of SLiM you're using on your cluster and your local machine, but I'd recommend updating both of them to 4.1 once it is released, and hopefully the problem will then go away.

My guess is that the error you're seeing on your local machine is a symptom of a memory error that was not caught, and that then led to other errors downstream. I doubt that it indicates the true nature of the problem.

So let's keep this issue open for now. Please update to SLiM 4.1 when it is released, and if it fixes the problem, we can close this; otherwise, we'll delve more deeply.

ShyamieG commented 9 months ago

Sorry! I meant to include that I'm using SLiM 4.0.1 on both cluster and local machine. I will update to SLiM 4.1 and check again.

To some of your other points, my script uses 6-7GB RAM - I did try requesting up to 500GB and that did not solve the issue. I'm simulating malaria parasites proliferating within a human host - the genome is 24MB and the population size can get up to 1e5. Mutation rate is 3e-10. Tree seq simplification happens at every time step.

bhaller commented 9 months ago

Sorry! I meant to include that I'm using SLiM 4.0.1 on both cluster and local machine. I will update to SLiM 4.1 and check again.

To some of your other points, my script uses 6-7GB RAM - I did try requesting up to 500GB and that did not solve the issue. I'm simulating malaria parasites proliferating within a human host - the genome is 24MB and the population size can get up to 1e5. Mutation rate is 3e-10. Tree seq simplification happens at every time step.

OK, thanks for the additional info. I'll tag @petrelharp here, but let's wait to see how SLiM 4.1 fares with this.

bhaller commented 9 months ago

SLiM 4.1 is now released: https://groups.google.com/g/slim-discuss/c/9P3KoPEYpDg.

ShyamieG commented 9 months ago

Hi @bhaller!

Your update fixed the segmentation fault issue, so I have been able to check back on this. Unfortunately, I found that updating to SLiM 4.1 didn't fix this issue, so I might need more help to figure out what is going wrong.

I requested 500GB of RAM to run the code above, and it still throws the oom TSK_ERR_NO_MEMORY error. I'm fairly certain this step doesn't actually need that much memory to run - seff reports peak memory utilization is ~4MB. My second example above, which does work, takes a slightly larger input file to start with and peaks at ~360MB of memory usage.

(I'm also running into a new bug that has surfaced at a different step 😞)

Any help would be appreciated!

bhaller commented 9 months ago

OK, so, segfault fixed but original issue not fixed. See #419 for my comment on it. It is also a tree-sequence problem, and the tree-sequence file you're trying to load seems, from #419, to be corrupted, so maybe the problems are related. Maybe in some cases the input file is corrupted in a way that SLiM's error-checking doesn't catch, but that causes tskit to run out of memory (or think erroneously that it has run out of memory) further on?

The script that you provided for this issue runs fine on my local machine (after the segfault was fixed); I ran it a bunch of times to be sure. I don't know why the problem would be platform-specific; that's a bit weird. You might make sure that you have current versions of everything installed, if you haven't already – msprime, tskit, pyslim, kastore, and whatever is is involved on the Python side, plus the GitHub head version of SLiM that fixes the segfault. It might also be worth checking that the compiler that is being used on your cluster is not horrendously out of date; if it's a really ancient compiler version, or something else is wrong with the toolchain setup, it could conceivably cause problems. I doubt it, but I'm trying to think of reasons that the model would run fine locally but error out on your cluster.

But mostly let's work on this question of where the tree-sequence file came from and why it seems to be corrupted; that seems more likely to be the problem. If the input tree-sequence file was generated by an older version of SLiM, please re-generate it with current versions of everything, to eliminate cross-version glitches. What else might be going on?

@petrelharp, if you run her script now (with the current GitHub head version of SLiM that fixes the segfault problem), are you able to reproduce the tskit error? It would be useful to catch this out-of-memory error in the debugger. Presumably, some call in to tskit is asking it to do something crazy – add an enormous number of rows, or allocate an enormous metadata buffer, or something like that – which is triggering the out-of-memory error. It might be a signed -1 getting cast to a (very large) unsigned integer somewhere. If we could figure out what code path is leading to that, maybe we could at least improve the error message by catching the problem further upstream – saying "Your input file appears to be corrupted, because " or something of that sort.

I had the paragraph below about checking for a regression with a specific SLiM commit, but now I realize that that doesn't make sense. You were getting these out-of-memory errors with SLiM 4.0.1, so whatever is going on here, it is an old problem, not a new regression. I must say I suspect this corrupted tree sequence file is the culprit; that is very fishy.

Hmm, signed -1 being cast to unsigned integer, that rings a bell. @ShyamieG, can you try running commit d1acf85abda4d36852bc30ef01371f2842bcf7f1, from November 14th, to see how it does with your model (with an input tree-sequence file that seems to work up until the out of memory error)? Right after that commit, I made commits that involved some code cleanup changes that were recommended by a code-checking tool called clang-tidy. Some of those changes involved modifications to how signed versus unsigned integers are handled. SLiM passed all of its tests after those changes, but it is nevertheless possible that a bug crept in. Sorry for the hassle, but this information would be quite useful. (If you're not sure how to check out a specific commit on your cluster and build from that, Peter can help you with that, he's the Git master. :->) (Looking at the diffs from the clang-tidy fixes, though, they do not look incorrect to me.)

ShyamieG commented 8 months ago

But mostly let's work on this question of where the tree-sequence file came from and why it seems to be corrupted; that seems more likely to be the problem. If the input tree-sequence file was generated by an older version of SLiM, please re-generate it with current versions of everything, to eliminate cross-version glitches. What else might be going on?

After I changed my scripts to not re-label SLiM nodes, #419 is fixed, but this issue persists. Everything was re-run using the latest version of SLiM 4.1 built from source to eliminate cross-version glitches.

The script that you provided for this issue runs fine on my local machine (after the segfault was fixed); I ran it a bunch of times to be sure. I don't know why the problem would be platform-specific; that's a bit weird. You might make sure that you have current versions of everything installed, if you haven't already – msprime, tskit, pyslim, kastore, and whatever is is involved on the Python side, plus the GitHub head version of SLiM that fixes the segfault. It might also be worth checking that the compiler that is being used on your cluster is not horrendously out of date; if it's a really ancient compiler version, or something else is wrong with the toolchain setup, it could conceivably cause problems. I doubt it, but I'm trying to think of reasons that the model would run fine locally but error out on your cluster.

Very interesting that you were able to run the example on your local machine. I did a fresh conda environment install with the latest versions of all my packages and re-ran everything again, but the error persists. I will go back to IT to let them know you were able to run the code and see what they suggest.

Thanks for your help!

bhaller commented 8 months ago

But mostly let's work on this question of where the tree-sequence file came from and why it seems to be corrupted; that seems more likely to be the problem. If the input tree-sequence file was generated by an older version of SLiM, please re-generate it with current versions of everything, to eliminate cross-version glitches. What else might be going on?

After I changed my scripts to not re-label SLiM nodes, #419 is fixed, but this issue persists.

Perhaps I misunderstood the chat in #419. I thought that you were saying, over there, that simply by running a sequence of simulations that each loaded the tree sequence resulting from the previous step, you ended up with colliding node pedigree IDs. Are you saying here that that problem was actually a result of your modification of the tree sequence data, rather than a natural consequence of SLiM itself? I.e., there is no SLiM bug in #419 after all, just a bug in your script? Perhaps you could post a clarifying statement over on #419; apologies if I missed this detail in what you wrote before.

Everything was re-run using the latest version of SLiM 4.1 built from source to eliminate cross-version glitches.

Great. :->

The script that you provided for this issue runs fine on my local machine (after the segfault was fixed); I ran it a bunch of times to be sure. I don't know why the problem would be platform-specific; that's a bit weird. You might make sure that you have current versions of everything installed, if you haven't already – msprime, tskit, pyslim, kastore, and whatever is is involved on the Python side, plus the GitHub head version of SLiM that fixes the segfault. It might also be worth checking that the compiler that is being used on your cluster is not horrendously out of date; if it's a really ancient compiler version, or something else is wrong with the toolchain setup, it could conceivably cause problems. I doubt it, but I'm trying to think of reasons that the model would run fine locally but error out on your cluster.

Very interesting that you were able to run the example on your local machine. I did a fresh conda environment install with the latest versions of all my packages and re-ran everything again, but the error persists. I will go back to IT to let them know you were able to run the code and see what they suggest.

Perhaps I have again misunderstood, but I thought you were saying that you saw this error yourself only on your cluster, not on your local machine. That's why I've been trying to brainstorm reasons that it might occur only on the cluster. Do you see this out-of-memory error also on your local machine?

I'd say let's resolve the loose ends here before you talk to your cluster folks; I'd like to understand what's going on a bit better, if possible.

ShyamieG commented 8 months ago

Perhaps I misunderstood the chat in #419. I thought that you were saying, over there, that simply by running a sequence of simulations that each loaded the tree sequence resulting from the previous step, you ended up with colliding node pedigree IDs. Are you saying here that that problem was actually a result of your modification of the tree sequence data, rather than a natural consequence of SLiM itself? I.e., there is no SLiM bug in #419 after all, just a bug in your script? Perhaps you could post a clarifying statement over on #419; apologies if I missed this detail in what you wrote before.

The original error I was having prevented SLiM from loading the tree sequence file, and gave me a 'genome id mismatch' error message. After I changed my code to stop it from re-labelling the node metadata, I could load the tree sequences just fine. However, I am still seeing duplicate slim ids in the tree sequence file. It seems likely that my script is causing this issue as well, because it removes certain nodes and individuals from the tree sequence file before returning it to SLiM. I will get back to you about whether it happens even when I don't run my scripts (I suspect not).

Perhaps I have again misunderstood, but I thought you were saying that you saw this error yourself only on your cluster, not on your local machine. That's why I've been trying to brainstorm reasons that it might occur only on the cluster. Do you see this out-of-memory error also on your local machine?

On my local machine, I was seeing:

Assertion failed: (p_tables->nodes.individual[ind->genome1_->tsk_node_id_] == (tsk_id_t)tsk_individual), function AddIndividualsToTable, file /Users/bhaller/Desktop/SLiM/core/species.cpp, line 4970.
zsh: abort      slim -l 0 -seed 14081 -d "IN='H5834'" -d  -d "inf_idx='1893'" -d  -d  -d  -d
bhaller commented 8 months ago

Perhaps I misunderstood the chat in #419. I thought that you were saying, over there, that simply by running a sequence of simulations that each loaded the tree sequence resulting from the previous step, you ended up with colliding node pedigree IDs. Are you saying here that that problem was actually a result of your modification of the tree sequence data, rather than a natural consequence of SLiM itself? I.e., there is no SLiM bug in #419 after all, just a bug in your script? Perhaps you could post a clarifying statement over on #419; apologies if I missed this detail in what you wrote before.

The original error I was having prevented SLiM from loading the tree sequence file, and gave me a 'genome id mismatch' error message. After I changed my code to stop it from re-labelling the node metadata, I could load the tree sequences just fine. However, I am still seeing duplicate slim ids in the tree sequence file. It seems likely that my script is causing this issue as well, because it removes certain nodes and individuals from the tree sequence file before returning it to SLiM. I will get back to you about whether it happens even when I don't run my scripts (I suspect not).

Aha, I see. OK. Yes, please try the version of SLiM that is in the PR branch I mentioned, both (a) with your script making those changes, and (b) without your script. I would like to know whether the change that I made in the PR causes that #WARNING to be issued, in both cases. If your script is the culprit as you suspect, then I expect to see the warning in case (a) and not in case (b). And the PR's changes should also gracefully handle the discrepancy, so in case (a) it should not only warn, but also gracefully handle the discrepancy, such that you should no longer get duplicate IDs in case (a), because it should start new pedigree IDs higher, avoiding the collision. My hope is that that will also fix the other issues (out of memory, and your local error), which I am hoping are side effects of the tree sequence data being in a bad state. (It is also possible that the way in which you are modifying the tree sequence data is itself incorrect/inconsistent, and needs to be fixed; @petrelharp can work with you on that question. If that is the case, then the PR changes would not fix the whole problem, but would at least give a warning indicating that something fishy is going on.)

Perhaps I have again misunderstood, but I thought you were saying that you saw this error yourself only on your cluster, not on your local machine. That's why I've been trying to brainstorm reasons that it might occur only on the cluster. Do you see this out-of-memory error also on your local machine?

On my local machine, I was seeing:

Assertion failed: (p_tables->nodes.individual[ind->genome1_->tsk_node_id_] == (tsk_id_t)tsk_individual), function AddIndividualsToTable, file /Users/bhaller/Desktop/SLiM/core/species.cpp, line 4970.
zsh: abort      slim -l 0 -seed 14081 -d "IN='H5834'" -d  -d "inf_idx='1893'" -d  -d  -d  -d

Oh! I don't recall hearing this before. Interesting. Well, this does suggest that the inconsistent internal state of the tree sequence data is causing problems, I guess. Interesting that your local machine catches this, while the cluster (I hypothesize) fails to catch it, merrily continues down the path, and ends up with a memory error as a result. Anyhow, I am hoping that both problems will be fixed by the PR. I await your update, thanks!

ShyamieG commented 8 months ago

Aha, I see. OK. Yes, please try the version of SLiM that is in the PR branch I mentioned, both (a) with your script making those changes, and (b) without your script. I would like to know whether the change that I made in the PR causes that #WARNING to be issued, in both cases.

Okay, I rebuilt slim and re-ran. When my sequence hits the step I originally posted here, I see the Warning you just added AND the oom error goes away - YAY!

Now I'm going to run the sequence without that simplification step to address b). It might take me a while to report back on that because it takes much longer to run without that step.

As I have been mulling over all this, I think I realized how to do what I want almost entirely within SLiM. The only part I'm not sure about is the stuff in the annotation step, which modifies the user_metadata but not the tree sequence itself. Do you think that python step would be okay to keep?

It seems like most of my woes have been caused by my messing with the ts files without fully understanding the consequences for SLiM. It might be helpful for users if the SLiM/tskit documentation had some general warnings about doing weird stuff like this. Or maybe even a ts check-up function in tskit that could look for common types of discrepancies that would trip up SLiM? Including these issues, I've stumbled upon quite a few different ways of modifying a ts that SLiM does not like 🥲

bhaller commented 8 months ago

Aha, I see. OK. Yes, please try the version of SLiM that is in the PR branch I mentioned, both (a) with your script making those changes, and (b) without your script. I would like to know whether the change that I made in the PR causes that #WARNING to be issued, in both cases.

Okay, I rebuilt slim and re-ran. When my sequence hits the step I originally posted here, I see the Warning you just added AND the oom error goes away - YAY!

OK, great. I now understand the problem, I think, although I'm still not sure what the right fix is. Your input tree sequence is probably still bad, @ShyamieG; the fix in the PR is just masking the problem (and warning about it). I'm not sure where to take that next.

Now I'm going to run the sequence without that simplification step to address b). It might take me a while to report back on that because it takes much longer to run without that step.

As I have been mulling over all this, I think I realized how to do what I want almost entirely within SLiM. The only part I'm not sure about is the stuff in the annotation step, which modifies the user_metadata but not the tree sequence itself. Do you think that python step would be okay to keep?

I'll let @petrelharp speak to that, I'm not sure.

It seems like most of my woes have been caused by my messing with the ts files without fully understanding the consequences for SLiM. It might be helpful for users if the SLiM/tskit documentation had some general warnings about doing weird stuff like this. Or maybe even a ts check-up function in tskit that could look for common types of discrepancies that would trip up SLiM? Including these issues, I've stumbled upon quite a few different ways of modifying a ts that SLiM does not like 🥲

Yes, this is the big remaining question: what to do about all this. The tree sequence data format is quite complex, with many requirements and dependencies and so forth; some of those are imposed by SLiM, but many of them are imposed by tskit and the more fundamental nature of the data structure. There are many ways that it could be incorrect/corrupted, and it's hard to guess which of those would be more likely/common. And checking for all of these different possible errors is itself complex, which slows down SLiM's load time, which matters a lot to some folks. I'm not sure what to do. To some extent, I think that if somebody goes in and modifies the data of the .trees file, they're off-piste and we can't really try to make that safe in general. Just like if somebody hand-edits the binary data inside a Microsoft Word document or something – Microsoft is not even going to try to make that a safe/easy thing to do. :-> Maybe there are ways we could document things better, check for correctness better, etc., but I'm not sure how far down that road we want to go, and I'm not sure we want to really encourage it by smoothing the way for it. If there are usage cases where it is really needed, it might be better for us to support those usage cases with new calls in SLiM/tskit/pyslim, so that users can do those things without needing to do it this way. @petrelharp maybe you and I ought to have a zoom about this?

petrelharp commented 8 months ago

Well, let me answer this question:

I'm not sure about is the stuff in the annotation step, which modifies the user_metadata but not the tree sequence itself. Do you think that python step would be okay to keep?

Well, modifying user_metadata should be very safe - SLiM doesn't use that for anything.

bhaller commented 8 months ago

Hi folks. OK, I have merged the PR. @ShyamieG, I recommend that you now work on the head of the master branch. It has all of the fixes resulting from these issues, including the segfault fix, handling the case of a genome ID (divided by 2) being greater than any individual ID in the tree sequence (which was due to a problem with your Python script, I guess, but which it turns out can also occur naturally), and checking for duplicate genome pedigree IDs (which perhaps would have given you a better error message than the out-of-memory error you were getting, as a result of the other problems). I'm going to close this issue now. Thanks for the very useful report, and please let us know if you hit any further problems!