MesserLab / SLiM

SLiM is a genetically explicit forward simulation software package for population genetics and evolutionary biology. It is highly flexible, with a built-in scripting language, and has a cross-platform graphical modeling environment called SLiMgui.
https://messerlab.org/slim/
GNU General Public License v3.0
160 stars 30 forks source link

segmentation fault after addCloned() or takeMigrants() with haploids #418

Closed ShyamieG closed 9 months ago

ShyamieG commented 9 months ago

Hello @bhaller

I recently updated to SLiM 4.1 to address the issue I was having here. Unfortunately, I wasn't able to see if that issue was resolved by the update because it is failing at an earlier point with a segmentation fault.

My script starts with a tree sequence file and goes through 5 cycles before failing. Please see attached for my input files and slim script, and below for the bash code:

slim -l 0 -seed 13926 -d "IN='V1204'" -d "infile='inf-H-seed_Hseed_to_V1204_on_day_17_simplified_annotated.ts'" -d "inf_idx='61'" -d "out=c('H7058','H7295')" -d "sampling_events=c(13,35)" -d "outpath='.'" \
                                  -d "mu='3.00E-10'" \
                                  -d "positions_file='PvP01_genome_positions'" \
                                  -d "rates_file='vivax_genome_rates'" \
                                  -d "N_transmitted='20'" \
                                  -d "p_males='0.25'" \
                                  -d "p_zygotes='0.1'" \
                                  -d "N_sporozoites='4000'" \
                                  -d "p_gut_saliva='0.2'" \
                                  -d "N_sampled='100'" mdsr.slim

The output:

WITHIN-VECTOR PARAMETER SUMMARY:

* Genome positions read from '/hpc/group/goldberg/ssg36/scripts/P_vivax/misc_scripts/PvP01_genome_positions'
* Recombination rates read from '/hpc/group/goldberg/ssg36/scripts/P_vivax/misc_scripts/vivax_genome_rates'
* Trees to be loaded from /work/ssg36/RM_sim2_rep5_SLiM_sim1_rep5/inf-H-seed_Hseed_to_V1204_on_day_17_simplified_annotated.ts
* 25.0% of infecting gametocytes are male
* Number of zygotes produced is 10.0% the number of infecting gametocytes
* 4000 sporozoites are produced
* 20.0% of sporozoites successfully invade salivary gland, on average
* 20.0 salivary sporozoites to be transmitted to host, on average
#WARNING (Species::RunInitializeCallbacks): with tree-sequence recording enabled and a non-zero mutation rate, a neutral mutation type was defined and used; this is legal, but usually undesirable, since neutral mutations can be overlaid later using the tree-sequence information.

STARTING SIMULATION

transmission/sampling start
transmission/sampling end
made it to day 18
transmission/sampling start
transmission/sampling end
start microgamete formation
end microgamete formation
start microgamete formation
end microgamete formation
start microgamete formation
end microgamete formation
start microgamete formation
end microgamete formation
start microgamete formation
end microgamete formation
made it to day 19
transmission/sampling start
transmission/sampling end
start zygote formation
end zygote formation
made it to day 20
transmission/sampling start
transmission/sampling end
made it to day 21
transmission/sampling start
transmission/sampling end
sporozoite mitosis start
sporozoite mitosis end
sporozoite mitosis start
sporozoite mitosis end
sporozoite mitosis start
sporozoite mitosis end
sporozoite mitosis start
sporozoite mitosis end
sporozoite mitosis start
sporozoite mitosis end
sporozoite mitosis start
sporozoite mitosis end
sporozoite mitosis start
sporozoite mitosis end
sporozoite mitosis start
sporozoite mitosis end
sporozoite mitosis start
sporozoite mitosis end
sporozoite mitosis start
sporozoite mitosis end
sporozoite mitosis start
sporozoite mitosis end
sporozoite mitosis start
sporozoite mitosis end
sporozoite mitosis start
sporozoite mitosis end
sporozoite mitosis start
sporozoite mitosis end
sporozoite mitosis start
sporozoite mitosis end
sporozoite mitosis start
sporozoite mitosis end
made it to day 22
oocyst ruptures start
oocyst ruptures end
Segmentation fault

This step still runs just fine using slim 4.0.1 and produces the expected output. Any idea what could be going on? Thanks for your help!

petrelharp commented 9 months ago

I can confirm this happens with commit 18e37aea, the last commit before the tskit update (as well as the commit after). gdb suggests it happens in TallyMutationRunReferencesForPopulation.

I can also confirm that there's no segfault with v4.0. In trying to bisect the bug I'm running into problems - for some reason, I can't compile SLiM at a bunch of the intermediate commits (including v4.0.1, oddly).

petrelharp commented 9 months ago

ps. nice job pushing the boundaries of SLiM, @ShyamieG =)

bhaller commented 9 months ago

Hi @petrelharp and @ShyamieG! Sorry for the slow response; I was having a little downtime after getting SLiM 4.1 out the door. :->

Sometimes I break the build for a little while, usually unintentionally – things build on my machine, but have an error on GitHub Actions. (I know, I should always work with PRs and squash and so forth, to avoid this. :->) So, sorry for the difficulties bisecting, @petrelharp – thanks for trying. The bisect would show that the bug came in at the point where I re-architected the way that mutation frequencies are tallied under the hood. That rearchitecture does not look like it introduced a bug, but it exposed a previously dormant bug elsewhere in the code – so even if you got it to bisect, the bug is not actually in those diffs, it's a pre-existing bug probably dating back quite a while in a totally different part of the code. :->

The problem is with SLiM's tracking of which subpopulations contain null genomes (as used in haploids for example, as placeholders for the "missing" genome). In this model, addRecombinant() is used to generate haploids, with null genomes, into p7, and p7 is correctly marked as containing null genomes (which turns off certain optimizations, since things are more complicated with null genomes present – you have to check for them and handle them differently). So far so good. But then one of those haploids is cloned, with addCloned(), into p8, and p8 is not correctly marked as then containing null genomes. The new mutation frequency tallying code uses that flag for some optimizations, and says "aha, no null genomes, here we go!" and then promptly crashes because, hey, null genomes. :-> The nice thing about this class of bugs is that they always produce a hard crash, so they're easy to get in the debugger. Back in SLiM 4.0.1, the mutation tallying code didn't rely on that flag, and so it noticed the null genomes and handled them gracefully – bug masked.

So, I know what the bug is, I just need to fix it. I'll post again here when I've done so. :->

@elissasoroj This bug might bite shadie. If it does, then I will need to do a 4.1.1 release, since we need shadie to work. :-> Please let me know at your earliest convenience – just run a variety of shadie test models for your different mating systems, and if none of them crash, then shadie is fine. The bug would be a 100% reproducible hard crash. Thanks!

(If the bug doesn't bite shadie then it might not bite many models at all – you might be the only one, @ShyamieG. In that case, I might just ask you to build SLiM from sources to get the bug fix, since I don't really want to roll 4.1.1 right now.)

bhaller commented 9 months ago

Ha, it's not addCloned() putting the null genomes in p8, it's takeMigrants(). Sneaky. The bug was also present in addCloned(), but that does not appear to be biting this model, or at least it isn't the first crash we hit. :-> But since the bug is present in addCloned() too, I remain concerned about shadie. Anyhow, with the fixes to both methods in place, the model runs to completion without problems. I want to do some additional testing and checking before I commit this fix, though. I should have the fix in on Monday.

ShyamieG commented 9 months ago

Thank you so much! I'll look out for the official fix next week.

bhaller commented 9 months ago

Hi @ShyamieG! The fix is now pushed to GitHub. Please let me know if you encounter any further problems. Thanks for the very helpful bug report!