NBISweden / pgip-tools

Population genomics in practice utility and exercise package
0 stars 0 forks source link

bug in pyslim.recapitate #1

Open petrelharp opened 1 year ago

petrelharp commented 1 year ago

Hi, Per - I've found this repository by searching for pyslim.recapitate on github, and thought you should know about a just-fixed bug that might potentially have affected you. Briefly: between pyslim 0.700 and 1.0.1, there was a bug that introduced a bottleneck to diploid size Ne=1 at the time of recapitation in most cases. That's fixed in pyslim>=1.0.2; see the CHANGELOG for some more details: https://github.com/tskit-dev/pyslim/blob/a0f9cf4ffe7366ba46d5fe9b7ebe633a1dbbf997/CHANGELOG.rst I have not scrutinized the code to see how it might affect you.

A great many apologies if this has affected you. Feel free to respond to this issue if you have questions (and email me if I don't notice).

Here's more information from a post to the SLiM email list:


How to fix it: Just install the most recent version of pyslim, e.g., by doing pip install --upgrade pyslim and make sure any requirements.txt files have pyslim>=1.0.2

Description: Before this fix, using the ancestral_Ne parameter to pyslim.recapitate( ) would introduce a bottleneck in each SLiM subpopulation down to diploid size Ne=1 for 1 or 2 generations in most situations. The bug would not occur if either (a) it was a WF simulation, with calls to addSubPop() in first() or early() and treeSeqOutput() in late(), or (b) it was a nonWF simulation, with calls to addSubPop() in first() and treeSeqOutput() in early() or late(). The fix correctly starts the msprime population with effective size ancestral_Ne at the time of the roots, which might be at the value of ts.metadata['SLiM']['tick'], this value minus 1, or this value minus 2. (Previously, the msprime population always started at ts.metadata['SLiM']['tick'] ticks ago, with populations of size 1 for the intervening 0, 1, or 2 ticks.)

Note: Now, recapitate throws an error if any roots of any trees are not at the same time as the others. I cannot think of any use cases this will cause a problem for, but let me know if you encounter one.

How might this affect you? The easy answer is - if you used pyslim.recapitate(..., ancestral_Ne=xxx) on a SLiM simulation fitting the description above, then the result you got is probably not what you intended, and you should re-do it. (Hopefully, you have the pre-recapitated tree sequence around, and can just do recapitation again.) What about published studies? Are they all wrong if they used recapitate? Well, that's a harder question - probably not, and all models are approximations anyhow, but some situations are more affected than others. The answer depends on how important the diversity introduced by recapitate is to the results. Here's some examples:

Example: I ran a SLiM simulation for 5N generations, and then recapitated. How would this affect me? Happily, probably not much. Recapitation is providing any ancestral diversity present at the time; and most of the genome will have coalesced by 5N generations ago. So, the bug will have resulted in less genetic diversity only on the small pieces of genome that had not coalesced by that point. How much less? Skipping the details, something like 50% less diversity on the 8% of the genome that hasn't coalesced; so, reducing genetic diversity at the end by 4%. Of course, if your method is specifically looking for short haplotypes from a long time ago, there is more cause for concern.

Example: I ran a large spatial simulation for 100 generations, and then recapitated. How would this affect me? Here I'd be more worried: the strong bottleneck 100 generations ago would reduce diversity by something like 50%, and specifically in very long haplotypes. As a result, the simulation would resemble a recently established population from a single migrant.

percyfal commented 1 year ago

Hi Peter, thanks for the heads up! I have so far only done one (fairly large) simulation in preparation of course material. The end result looked as expected, so I don't think this was a big problem for me. With that said, I will rerun the simulations anyway and pin to pyslim>=1.02.

/P

petrelharp commented 1 year ago

Ah, good! But... you'll need >=1.0.3, my bugfix had a bug (sorry).