Open sdtemple opened 3 months ago
Thanks for the question. The
On the one hand, (1) if you want very accurate IBD segments (when gene conversion is not simulated = all breaks are due to sexual recombination), you can set <mincm>
to a smaller number so these smaller pieces can be written to the output. For point (2), I would avoid setting
On the other hand, of course, using a large
Thanks! This is a great and helpful response. I did realize that I had to simulate without gene conversion to the get the true IBD segments in the Browning-esque defintion.
In this pre-print (here), Supplementary Figure 9, I got the tskibd "true" IBD segments to give similar results to hap-ibd inferred IBD segments (for selection coefficient estimation). On the other hand, I would have expected (Table 1 and Supplementary Table 2 and other unpublished results) that the tskibd true IBD segments would not have biased selection coefficient estimates for s <= 0.02.
These observations are why I would agree with you: to get exact / highly accurate true IBD segment lengths you would have to have no gene conversion and set the sampling window to 1 or close to it.
Do you have any idea about the runtime and memory trade-offs as the
I am glad my explanation was helpful. I have not spend enough time on characterizing the relationship of run time with sample-window argument. But for small sample size, it won't be too bad. For large sample size, there will be more local trees and it can be slow.
Actually, I have been working on a separate project to implement the same function (calling true IBD) but fully take advantage of the high correlation between adjacent trees: for adjacent local trees, only a few edges are added, removed or rearranged . The new project aims to go through tree by tree (without skipping trees) so it can be extremely accurate but also very fast. The main algorithm has been coded but I will need to write more tests and make it highly parallelizable. I can share the code when I have it tested, maybe within a month.
Great. Thanks for all the responses and I look forward to your updates. I'll be doing some benchmarks on the tskibd runtime and accuracy for a project of mine as well, particularly as I play around with the sample window parameter. For accuracy, I would expect that the msprime + tskibd approach (w/o gene conversion) should get similar numbers of true IBD segments >= Y cM than the function simulate_ibd
in isweep when you set the chromosome to be of size 2 x Y cM and calculate the number of IBD segments overlapping the middle of such chromosomes.
In some preliminary results, I am finding that the msprime + tskibd approach is much faster at getting IBD segments than ARGON.
Does your program output all of the IBD segments? I am curious about the parameter. Is this parameter designed to manage memory and runtime resources? Could using too small of a value for this parameter lead to not every true IBD segment being reported?