Invalid child values in large simulations

molpopgen commented 4 years ago

I'm basically moving tskit-dev/tskit#402 to here. I submitted number #61 to trigger the bug.

The code in #61 gives the output shown below when used with the script that follows after

terminate called after throwing an instance of 'std::runtime_error'
  what():  child index out of range: -704, 214748263

The two numbers are a child index and the length of the node table. The script is:

initialize() {
        // initializeSLiMModelType("nonWF");
        initializeTreeSeq(simplificationInterval=500);
        // defineConstant("Np", 50);
        // defineConstant("K", 1000000);        // total carrying capacity
        // defineConstant("Kp", asInteger(K / Np));

        initializeMutationType("m1", 0.5, "f", 0.0);
        // m1.convertToSubstitution = T;

        initializeGenomicElementType("g1", m1, 1.0);
        initializeGenomicElement(g1, 0, 600);
        initializeMutationRate(0);
        initializeRecombinationRate(0);
}
// reproduction() {
//      subpop.addCrossed(individual, subpop.sampleIndividuals(1));
// }
1 early() {
        print(time());
    sim.addSubpop("p1",1000000);
        // for (i in 1:Np)
        //      sim.addSubpop(i, Kp);
}
// early() {
        // for (subpop in sim.subpopulations)
    //     print(subpop.individualCount);
        // for (subpop in sim.subpopulations)
        //      subpop.fitnessScaling = Kp / subpop.individualCount;
// }
500 late() {
        print(time());
        sim.outputFixedMutations();
}

It seems that there are many fewer than expected nodes? Also, it really isn't clear how the negative child index is happening.

bhaller commented 4 years ago

OK, so I guess I'm continuing to be dense, but. 1e9+2e6 is 1002000000. That is larger than 214748263, the reported number of rows in the node table. So why do you say that "slim is adding far too many nodes during a simulation somehow"? If anything, isn't the number of nodes much smaller than expected? (Which also seems odd.) I'm also puzzled by the negative child index of -704; do you think we've overflowed the range of int32_t here? That would seem quite surprising – and even more surprising that we didn't crash a long time ago if we really did that, since we must have been writing values to illegal memory locations for quite a long time! Can you spell out for me what exactly you think this debug output shows? Sorry – the tree-seq stuff is not my area of expertise, perhaps @petrelharp should be in the loop on this since he's the main architect of it in SLiM.

bhaller commented 4 years ago

For what it's worth, when I run the model with the debug code I get the same error, "child index out of range: -704, 214748263". But I don't know how to interpret what I'm seeing.

molpopgen commented 4 years ago

That is larger than 214748263, the reported number of rows in the node table. So why do you say that "slim is adding far too many nodes during a simulation somehow"?

oops--I was off when I wrote that.

molpopgen commented 4 years ago

For what it's worth, when I run the model with the debug code I get the same error, "child index out of range: -704, 214748263". But I don't know how to interpret what I'm seeing.

This is explained above, and should be clear from the diff in #61 where the values come from. Yes, the negative number is the problem, though. I agree--if this problem was happening all the time, then you'd see crashes all the time. Indeed, the node table is too small if anything.

molpopgen commented 4 years ago

Can you spell out for me what exactly you think this debug output shows?

It seems to show that a negative child index is being recorded (see diff in #61). I am assuming that we are entering the tskit code for the first time, but it is hard for me to guarantee that as there seems to be multiple ways to get there (sort and simplify are called on at least two different lines in slim_core.cpp).

bhaller commented 4 years ago

Yes, I understand the literal meaning of the debug output, it's the interpretation – what is really happening to lead to that output – that I'm having trouble with. @petrelharp, thoughts? This is bizarre.

molpopgen commented 4 years ago

Yeah, I have no idea about the interpretation, other than that the issue happens before hitting any tskit code, meaning it is generated somewhere within slim. I've been thinking about this b/c I've been working on understanding if it really is an issue in tskit, and all evidence so far points to "no". So I dug in to the slim code a bit and did #61. It seems that more aggressive debug code will be helpful during the sims to see if you can catch an invalid child ID in action.

bhaller commented 4 years ago

Yes, if this code is doing what it intends to do I'll probably add it to the DEBUG build of SLiM. The more DEBUG checks the better. :->

molpopgen commented 4 years ago

@petrelharp can say more, and it looks to me like it could/should be a function that gets called at all places right before the tsk_foo function for sorting tables, in order to guarantee that all execution paths hit it.

I did it via exceptions b/c that was easiest for me. I'm guessing you have some way of handling these things so that your GUI doesn't explode upon error?

molpopgen commented 4 years ago

For the record, fwdpy11 is also missing a check like this. I added one in order to generate a table collection of the same size for testing, but haven't committed it. I need to decide where and how is best to put it in.

molpopgen commented 4 years ago

In the meantime, I just added some printf statements into the tskit code to see if sort/simplify ever do get hit there. It takes a couple hours to run, so I'll get back to you tomorrow.

It is such a bizarre number of nodes that it seems to me that simplify may have been run, even though the interval is set to the simulation length

molpopgen commented 4 years ago

no printf output.

jeromekelleher commented 4 years ago

Just catching up on this now - strange stuff! But, looks like we can probably close tskit-dev/tskit#402?

molpopgen commented 4 years ago

Yes we can close the issue in tskit. I still think @petrelharp should take a look at slim, though, to sort out what's going on?

On Thu, May 7, 2020, 9:50 AM Jerome Kelleher notifications@github.com wrote:

Just catching up on this now - strange stuff! But, looks like we can probably close tskit-dev/tskit#402 https://github.com/tskit-dev/tskit/issues/402?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MesserLab/SLiM/issues/62#issuecomment-625371463, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQ6OH556F3Y2DGZT2XSOJLRQLRDNANCNFSM4JRNFIWA .

bhaller commented 4 years ago

I just put the debugging code from #61 into SLiM, ran the SLiM model given above, and did not get any debugging output. The debug code did get executed at the end of the run – I had a breakpoint on it – and the number of nodes in the node table was 214748263 as before. But it generated no debug logs. Not sure why, but unimportant.

I then extended the check to also look for a negative value for either index, and I moved the debug code into _RunOneGenerationWF() so that it gets run in every generation, and changed it to log to std::cout rather than throwing, and had it log the number of edges and number of nodes in each generation, since that seemed odd (see discussion above). Here's the resulting output:

1: tables_.edges.num_rows == 2000000, tables_.nodes.num_rows == 4000000
2: tables_.edges.num_rows == 4000000, tables_.nodes.num_rows == 6000000
3: tables_.edges.num_rows == 6000000, tables_.nodes.num_rows == 8000000
4: tables_.edges.num_rows == 8000000, tables_.nodes.num_rows == 10000000
5: tables_.edges.num_rows == 10000000, tables_.nodes.num_rows == 12000000
...
104: tables_.edges.num_rows == 208000000, tables_.nodes.num_rows == 210000000
105: tables_.edges.num_rows == 210000000, tables_.nodes.num_rows == 212000000
106: tables_.edges.num_rows == 212000000, tables_.nodes.num_rows == 214000000
107: tables_.edges.num_rows == 214000000, tables_.nodes.num_rows == 214748263
child index out of range: -704, 214748263
108: tables_.edges.num_rows == 216000000, tables_.nodes.num_rows == 214748263
child index out of range: -704, 214748263
109: tables_.edges.num_rows == 218000000, tables_.nodes.num_rows == 214748263
child index out of range: -704, 214748263
110: tables_.edges.num_rows == 220000000, tables_.nodes.num_rows == 214748263
child index out of range: -704, 214748263
...

So, something clearly went wrong in generation 107; it looks like probably the nodes table overflowed or something, although 214748263 is not a round power of two (log2 of it is 27.678, which doesn't suggest anything to me).

It seems like the upshot here is:

(1) A sufficiently large model can overflow the tree-sequence recording tables;

(2) This overflow is not being caught by the tskit code, at least the version of that we're presently using in SLiM (which is a bit out of date I think), which seems like a bug unless we are somehow causing the overflow in SLiM without using a tskit API to do it;

(3) This debug code would perhaps be a useful thing to add somewhere in SLiM. Perhaps once we understand the bug better, a more precise check for the actual overflow condition can be added too/instead, though. Scanning every entry of the edge table in every generation is probably a bit time-consuming. But when in DEBUG mode, at least, it might be reasonable. Maybe there are a bunch of other similar integrity checks we ought to be doing as well. If they help catch a bug like this, they're worth it. :->

So, (2) seems like where we need to focus for now. @petrelharp, do you have a suggestion as to where in the code I ought to look to pin this down, or why it might be happening, or the significance of 214748263?

bhaller commented 4 years ago

@petrelharp, you have been intending to merge new tskit code into SLiM for a while now. Maybe now would be a good time to do that, so that I'm not debugging this issue against stale tskit code? It'd be nice to be sure that this bug still actually exists before spending a bunch of time tracking it down. :-> When do you think you'd have cycles free to do that merge?

bhaller commented 4 years ago

@molpopgen @jeromekelleher any insights before I delve deeper into this?

molpopgen commented 4 years ago

Afraid not. You are still only 1/10th of the way to overflow by the time the error happens, which is why this is so odd.

In [1]: import numpy as np                                                                                                                                                               

In [2]: x = 214000000                                                                                                                                                                    

In [3]: np.iinfo(np.int32).max                                                                                                                                                           
Out[3]: 2147483647

In [4]: np.iinfo(np.int32).max/x                                                                                                                                                         
Out[4]: 10.034970313084113

bhaller commented 4 years ago

Interesting that the factor is almost exactly ten. I wonder whether perhaps a metadata buffer is overflowing, with 10 bytes per entry, or something like that.

petrelharp commented 4 years ago

If it is overflow, then I think you're right; we should definately merge in the new tskit code. I'm a bit overwhelmed at the moment, but this is pretty important: this is a big simulation, but not that big.

But, thinking more: if it really is overflow, then we shouldn't really be waiting for tskit to catch it, I think? The things inserted into the edge tables are node IDs, and the node IDs are assigned by SLiM and stored as properties of each individual (well, IIRC the indiv has ID n and the genomes have IDs 2n and 2n+1)? So at least for debugging and maybe always, we should be checking - say, when we add each new individual to the node table - whether their ID is negative?

I'm not convinced it's overflow, though: if it was overflow, why would the first one be -704? And as Kevin says, we shouldn't have gotten to overflow yet?

petrelharp commented 4 years ago

Interesting that the factor is almost exactly ten. I wonder whether perhaps a metadata buffer is overflowing, with 10 bytes per entry, or something like that.

Oh, it's probably the metadata, you're right. Never mind my prevoius comment about individual IDs.

petrelharp commented 4 years ago

So, is this caught in tskit now? Well, here's something representative of inserting metadata, which asserts that something isn't bigger than something; and in fact it is being checked that the second something doesn't get too big; which is not present in SLiM's code. Which is embarrasingly old.

So, once I merge in new tskit we should appropriately get an error here, if that's really the problem?

molpopgen commented 4 years ago

I guess I'm confused how metadata can lead to a negative node value in the edge table, but I never tried to track down how and where you are adding rows.

petrelharp commented 4 years ago

I was guessing by overflowing the metadata, which could be next to the edge table in memory? But then why would it be always the same problem?

bhaller commented 4 years ago

OK. I set a breakpoint in tsk_node_table_add_row_internal() when self->num_rows == 214748262 (one before the point at which things seem to go south). When that breakpoint is hit, self->metadata_length is 2147482620, which is 1028 less than 2^31. The next time tsk_node_table_add_row() is called, tsk_node_table_expand_metadata() is called with additional_length == 10 (so, 10 bytes per metadata entry, as suspected). In that call, expand_column() does get called to expand the metadata, and self->max_metadata_length_increment is 1024 so the size of the metadata buffer goes over 2^31 (or very nearly; 4 bytes short?). The expansion seems to proceed without drama, however, and tsk_size_t is uint32_t so it should be able to go up to 2^32, perhaps. I have not yet seen where things actually go south; I'm continuing to trace through the code. But it is certainly clear that it is the result of the metadata buffer growing beyond 2^31 bytes.

bhaller commented 4 years ago

OK, the next time that tsk_node_table_add_row() is called, tsk_node_table_expand_metadata() detects an overflow and returns TSK_ERR_COLUMN_OVERFLOW. Guess what the value of that constant is? -704. tsk_node_table_add_row() sees that ret != 0 and returns the -704 to the caller. In SLiMSim::RecordNewGenome(), that return value goes into offspringTSKID, with no check that it is positive – no error check at all. So, there's the bug.

petrelharp commented 4 years ago

ACK!!! We need to wrap that in

ret = tsk_node_table_add_row( ... )
if (ret < 0) {
   handle the tskit error
}

petrelharp commented 4 years ago

Looks like we've got the same error here and here and here and here and here maybe, although there's an error handler later... and a few more places...

... but deal with it correctly here.

bhaller commented 4 years ago

Yep. A quick search shows that we're calling tskit add_row() functions in various spots and not checking the return value we get back for being < 0. I think we never reviewed that aspect of our code after we got things working, @petrelharp! So I'll add a check not only there, but in several other spots too. OK, now the test model halts in SLiMgui and shows this error: "tsk_node_table_add_row: Table column too large; cannot be more than 2**31 bytes." Seems good. I just need to decide what debugging code to leave behind, and where to put it...

bhaller commented 4 years ago

I think I'm going to remove the debug check now that the bug is found, upon reflection. It's very specific to this scenario; one could write similar code to sanity-check all kinds of things about the state of the tables, which is probably a good idea, but (1) there's no reason to focus on checking this specific things without checking all the others, (2) checking all the others is not something I'm going to do right now, and really would exceed my level of knowledge of the inner workings of tskit, and (3) to that point, this check function would really make more sense in tskit itself, as a utility function that all clients could call to sanity-check their tables. Which would be great. I'll log a new issue to that effect.

petrelharp commented 4 years ago

I think we're good - tskit is generally very good at checking inputs; the problem here is that we weren't paying attention to the error it gave us! Thanks for tarcking that down!!

petrelharp commented 4 years ago

this check function would really make more sense in tskit itself, as a utility function that all clients could call to sanity-check their tables

Maybe you're looking for this function?

bhaller commented 4 years ago

this check function would really make more sense in tskit itself, as a utility function that all clients could call to sanity-check their tables

Maybe you're looking for this function?

Indeed. See https://github.com/tskit-dev/tskit/issues/592 which I just closed, but maybe should not have.

MesserLab / SLiM

Invalid child values in large simulations #62