Slim 4.1 core dumping on computing cluster

evromero-uw commented 7 months ago

Hello, I am trying to run simulations in SLiM 4.1 via snakemake. I noticed that every 20 or so jobs, one of the simulations will run out of allocated memory (even when I allocate 50Gb). However, if I run the exact same pipeline using SLiM 3.7, I do not run into this problem even with a smaller memory allocation (10Gb). I am including an example of the slim script and a screenshot of the error I am getting.

Thank you for your help!

Screenshot 2024-02-02 at 10 51 05 AM

slim_script.txt

bhaller commented 7 months ago

Hi! It looks like a need a file, hxb2.txt, to proceed.

evromero-uw commented 7 months ago

Apologies, here is the file. It's just the HXB2 reference sequence for HIV sliced to the length we're simulating hxb2.txt

bhaller commented 7 months ago

I'm running it now. It's a very small model; it shouldn't require anywhere NEAR 50Gb. That is crazy. It looks like it uses less than 1 Gb, in fact. So far I'm unable to reproduce the problem. So, a few questions:

Are you certain that this is the exact script you're running? If you're generating the script algorithmically in some way (you mentioned snakemake), could it be that the generated script is not what you think it is?
Are you able to reproduce the crash with a particular seed value? I've done several runs now and there's no sign of a problem; and this model is so simple and standard that tons of people around the world are running similar models without problems. I need more to go on.
If you're at all experienced as a developer, perhaps you could figure out how to get a backtrace from the crash, or even get it in the debugger? I wonder what that core dump looks like. I'm not experienced with core dumps, but I believe it is possible to load one into a debugger (on the same machine) to get a backtrace. That would be very useful; if you're not sure how, perhaps a local admin could help you?
Could you perhaps add print() calls all over the place in the model to determine where, in the Eidos script, the crash is occurring? Like, which tick, and inside which event, or maybe it is inside the SLiM core, not when executing Eidos code?
Perhaps you can prune the model script down to the absolute minimum necessary to trigger the crash; that would be helpful. If you remove all the calls to outputFixedMutations(), does it still happen? If you remove all the calls to outputSample(), does it still happen? And so forth. The more minimal you can get it, the better.

Basically I need something to hang my hat on here. I've run it a bunch of times now, while typing this, and haven't gotten a crash. I suspect it is something specific to the machine you're running on – a very old compiler version that is producing bad code, or a library version mismatch on the machine, or who knows what. There are a million things like that that can be wrong with a machine's configuration, and it's nearly impossible to debug. Is there simply a different machine that you can try your workflow on, to see whether it reproduces there also? Ideally a fairly different machine, not just an identically configured server in the same computing cluster or whatever.

evromero-uw commented 7 months ago

Yeah, it seems very tricky. Especially because the cluster version of 4.1 is on a new ubuntu build (the cluster is converting from centOS so they just rebuilt the SLiM 4.1 module). I will get back soon with more details.

Tried so far: I've contacted the IT people and they are looking through some of the host logs for failed vs successful jobs but say the core dump won't be useful "Without debugging symbols in the software, the core dump unfortunately won't be much use because it will lack the mapping to functions and line numbers in the source code. It should be possible to do that, though in many cases, without a good knowledge of the application, using a debugger isn't very fruitful."

I also added some print statements, and it looks like several generations are running properly before the failure occurs (simulation gets through 4 generations and then occurs before reproduction is complete).

bhaller commented 7 months ago

Well, of course you can tell them that you have someone with a "good knowledge of the application" on the hook – I wrote it. I'm quite used to interpreting backtraces, and they often are extremely useful. If debugging symbols aren't present (I'm slightly surprised by that), you can probably build SLiM 4.1 from sources with debugging symbols pretty easily. I'm not sure how, because it will depend on the platform/compiler/linker, so your IT folks might be more useful.

evromero-uw commented 7 months ago

I've done a bit more recon.

First, I checked both of the host builds (our cluster is currently migrating from CentOS to Ubuntu). It looks like on both builds, SLiM 3.7 is working but SLiM 4.1 is experiencing the bug.
The cluster admin checked the logs for some of the hosts executing the jobs that failed: "I took a look at the logs for the exec host around the time the jobs ended and didn't see any signs of the jobs or exec host running low on memory."
I reproduced the error while enabling core dumps and was able to get this backtrace.

Thank you for all of your help! snakejob.run_slim.81.sh.e301574291.txt

bhaller commented 7 months ago

I've done a bit more recon.

First, I checked both of the host builds (our cluster is currently migrating from CentOS to Ubuntu). It looks like on both builds, SLiM 3.7 is working but SLiM 4.1 is experiencing the bug.

The cluster admin checked the logs for some of the hosts executing the jobs that failed: "I took a look at the logs for the exec host around the time the jobs ended and didn't see any signs of the jobs or exec host running low on memory."

I reproduced the error while enabling core dumps and was able to get this backtrace.

Thank you for all of your help! snakejob.run_slim.81.sh.e301574291.txt

Hi! OK, that backtrace is helpful, thanks!

It looks like perhaps a memory allocation is being asked for that is suddenly much too large, and fails. I've added a debugging log at that point, which might perhaps be helpful. That is committed on GitHub, so it would be helpful now if you could build SLiM from the GitHub sources (the head of the master branch). Chapter 2 of the SLiM manual discusses how to do that, if you're not familiar with it. It should error out a bit earlier, at least.

But it looks possible that the problem occurs upstream of this failed allocation; it looks like something earlier might be corrupting memory, leading to a bad allocation size that then fails. It is harder to guess what the cause of that might be. I still suspect some problem with the machine configuration, like a library version mismatch or a toolchain problem. You say your cluster is currently migrating to a different Linux flavor. Could the state of things be bad, mid-migration? Could the SLiM 4.1 build have been done on top of libraries that have, themselves, now been replaced or rebuilt? I'd suggest that you ask the cluster admins to rebuild SLiM 4.1 for you, perhaps after they are finished migrating the cluster; or that you simply build your own SLiM from sources, which is not hard, and see how that works for you. Your model is so simple and trivial that it seems unlikely to me that this is really a SLiM problem; if it were, people around the world would be having similar problems. (But one never knows – maybe there is something unique about your model that triggers a bug where other apparently similar models do not trigger the bug.) Can you look into this possibility a bit more, for example by building your own SLiM 4.1 from the release tagged 4.1 on GitHub?

evromero-uw commented 7 months ago

Here I am attaching the new logs for two separate jobs that failed in different pipeline runs with the new build of SLiM from the master branch. snakejob.run_slim.93.sh.e301578699.txt snakejob.run_slim.155.sh.e301579107.txt

Currently our cluster is separated with different head nodes serving different CentOS + Ubuntu clusters On these, I have tried four builds of SLiM 4.1

Yesterday: cluster module built by IT -> tested on Ubuntu nodes
Yesterday: SLiM acquired by anaconda -> tested on both CentOS and ubuntu nodes
Today: two separate linux builds for CentOS and Ubuntu specifically downloaded from github master today.

Maybe there is some underlying software on both the new and old modes that is the same, but it seems to persist across multiple systems. The good thing is that I can use SLiM 3.7 as a work around so I'm okay if it's not possible to pinpoint the problem.

Thank you for all of your help, Elena

bhaller commented 7 months ago

Interesting. Thanks for the info, continues to be useful. I have added a debugging check at the point shown by those two new backtraces, too. It would be great if you could do a run against the current GitHub head. I'd like to see what those debug logs print out; I think this is a memory corruption error, which is causing a crazy population size to be used because that field in memory has had a bad value put into it. I'm hoping the debug output might confirm that suspicion. If so, there are then additional angles I could pursue. I'd really like to track this down, in case it is a SLiM bug that might bite other people. Thanks!

evromero-uw commented 7 months ago

Will do, is it the version of slim up to the tweak previous? Or is there a newer one.

bhaller commented 7 months ago

Will do, is it the version of slim up to the tweak previous? Or is there a newer one.

I don't know what "tweak previous" means, lol. But I did the last debug log commit just a couple of minutes ago, so you should get the current GitHub head version.

evromero-uw commented 7 months ago

Oh sorry, "tweak previous" was the most recent comment on the github. Anyways, I just cloned SLiM fresh, rebuilt it, and ended up with this backtrace snakejob.run_slim.187.sh.e301580237.txt

bhaller commented 7 months ago

Hmm. Looks like a memory corruption bug that is happening earlier than the point where the error is flagged in the backtrace. I've done a bunch of runs of your on my machine with various debugging tools enabled – ASan, UBSan, Guard Malloc, etc. – which would normally catch such a problem, and none of those tools has found anything. You've tried it on a couple of machines that are different hardware, but all administered by the same admin team, and thus perhaps all share the same underlying configuration problem. Have you tried to reproduce the problem on your own local machine, whatever that might be? Are you able to do so?

bhaller commented 7 months ago

I just added a little more debugging code; it'd be great if you could git pull to the current GitHub head, rebuild, and try again. I had an idea for possibly catching the memory smasher, and even found a small bug (which I think turns out to be unrelated). Let's see what happens. Thanks.

evromero-uw commented 7 months ago

I tried running it maybe 30-40 times on my local machine (macOS) and it works just fine so it does seem specific to either linux or something about the cluster setup, but I do not have any access to other linux machines with different configurations. I pulled the current head, rebuilt, and got these logs. snakejob.run_slim.385.sh.e301605232.txt snakejob.run_slim.195.sh.e301605008.txt

bhaller commented 7 months ago

OK, thanks. I think backtraces have revealed as much as they're going to reveal; the memory smash is happening somewhere else, prior to the crash. I will try running the model on a Linux machine I have access to. Good to know that it doesn't happen on your local machine; unfortunate, though, since debugging is easier on a local machine. Stay tuned, I'll let you know what my Linux tests reveal.

evromero-uw commented 7 months ago

Sounds good, it only occurs intermittently. Once every 30 or so jobs, so I think there is some stochastic effect (where memory gets allocated etc.). I only noticed it due to the fact that the failed job crashes the much larger snakemake pipeline. Thank you for your help/ sorry it seems so tricky.

evromero-uw commented 7 months ago

*Also figured I should mention that I haven't been running any parameters different from the slim script I posted, so that isn't changing as a variable. I was just running many different reps of the exact same simulation simultaneously, with the goal of conglomerating the data for downstream analyses.

bhaller commented 5 months ago

Well, I just got around to trying this on the Linux box I have access to. I did 50 runs, no problems. Since you and I both also can't reproduce it on macOS, I think this is pretty clearly a configuration problem with those cluster machines, as I've discussed above. There's no way for me to debug it further, but in any case there's probably nothing I could do about it anyway; probably it is just a bad library version, or a bad compiler version, or some other toolchain type of issue on those machines. You could try switching up the toolchain – using a different compiler than whatever those machines default to, for example. Apart from that, it's up to the system administrators to figure out. I'm going to close this issue now. Sorry!

MesserLab / SLiM

Slim 4.1 core dumping on computing cluster #425