elissasoroj / shadie

GNU General Public License v3.0
11 stars 4 forks source link

bug in pyslim.recapitate #15

Closed petrelharp closed 11 months ago

petrelharp commented 1 year ago

Hi - I've found this repository by searching for pyslim.recapitate on github, and thought you should know about a just-fixed bug that might potentially have affected you. Briefly: between pyslim 0.700 and 1.0.1, there was a bug that introduced a bottleneck to diploid size Ne=1 at the time of recapitation in most cases. That's fixed in pyslim>=1.0.2; see the CHANGELOG for some more details: https://github.com/tskit-dev/pyslim/blob/a0f9cf4ffe7366ba46d5fe9b7ebe633a1dbbf997/CHANGELOG.rst I have not scrutinized the code to see how it might affect you.

A great many apologies if this has affected you. Feel free to respond to this issue if you have questions (and email me if I don't notice).

Here's more information from a post to the SLiM email list:


How to fix it: Just install the most recent version of pyslim, e.g., by doing pip install --upgrade pyslim and make sure any requirements.txt files have pyslim>=1.0.2

Description: Before this fix, using the ancestral_Ne parameter to pyslim.recapitate( ) would introduce a bottleneck in each SLiM subpopulation down to diploid size Ne=1 for 1 or 2 generations in most situations. The bug would not occur if either (a) it was a WF simulation, with calls to addSubPop() in first() or early() and treeSeqOutput() in late(), or (b) it was a nonWF simulation, with calls to addSubPop() in first() and treeSeqOutput() in early() or late(). The fix correctly starts the msprime population with effective size ancestral_Ne at the time of the roots, which might be at the value of ts.metadata['SLiM']['tick'], this value minus 1, or this value minus 2. (Previously, the msprime population always started at ts.metadata['SLiM']['tick'] ticks ago, with populations of size 1 for the intervening 0, 1, or 2 ticks.)

Note: Now, recapitate throws an error if any roots of any trees are not at the same time as the others. I cannot think of any use cases this will cause a problem for, but let me know if you encounter one.

How might this affect you? The easy answer is - if you used pyslim.recapitate(..., ancestral_Ne=xxx) on a SLiM simulation fitting the description above, then the result you got is probably not what you intended, and you should re-do it. (Hopefully, you have the pre-recapitated tree sequence around, and can just do recapitation again.) What about published studies? Are they all wrong if they used recapitate? Well, that's a harder question - probably not, and all models are approximations anyhow, but some situations are more affected than others. The answer depends on how important the diversity introduced by recapitate is to the results. Here's some examples:

Example: I ran a SLiM simulation for 5N generations, and then recapitated. How would this affect me? Happily, probably not much. Recapitation is providing any ancestral diversity present at the time; and most of the genome will have coalesced by 5N generations ago. So, the bug will have resulted in less genetic diversity only on the small pieces of genome that had not coalesced by that point. How much less? Skipping the details, something like 50% less diversity on the 8% of the genome that hasn't coalesced; so, reducing genetic diversity at the end by 4%. Of course, if your method is specifically looking for short haplotypes from a long time ago, there is more cause for concern.

Example: I ran a large spatial simulation for 100 generations, and then recapitated. How would this affect me? Here I'd be more worried: the strong bottleneck 100 generations ago would reduce diversity by something like 50%, and specifically in very long haplotypes. As a result, the simulation would resemble a recently established population from a single migrant.

elissasoroj commented 11 months ago

Hi Peter,

Thanks for letting me know! I'll make sure to update the dependencies in shadie.

~Elissa