bguo068 / tskibd

Calculate IBD segments from tree sequence using tskit C API
MIT License
1 stars 0 forks source link

Exact or approximate number of IBD segments #5

Open sdtemple opened 3 months ago

sdtemple commented 3 months ago

Does your program output all of the IBD segments? I am curious about the parameter. Is this parameter designed to manage memory and runtime resources? Could using too small of a value for this parameter lead to not every true IBD segment being reported?

bguo068 commented 3 months ago

Thanks for the question. The control how frequent local trees along the genome are being checked for breaks of ancestral segments. The smaller is, the more trees are checked, which means your ends (start and end coordinates) of your IBD segments are more precise, and it takes more time to run.

On the one hand, (1) if you want very accurate IBD segments (when gene conversion is not simulated = all breaks are due to sexual recombination), you can set is set 1. (2) However, if gene conversion is simulated, breaks can be caused by not only recombination but also gene conversion. In this case, using a small value of will break down a long true IBD segments into smaller pieces as a result of gene conversion. When these small pieces are shorter than the threshold say 2cM, they won't be reported. You can check whether this is true, by setting <mincm> to a smaller number so these smaller pieces can be written to the output. For point (2), I would avoid setting to a very small number (such as less than gene conversion length).

On the other hand, of course, using a large will ignore breaks that happened in local trees that are between adjacent sampled trees. You might see fewer breaks (thus fewer IBD) than expected.

sdtemple commented 3 months ago

Thanks! This is a great and helpful response. I did realize that I had to simulate without gene conversion to the get the true IBD segments in the Browning-esque defintion.

In this pre-print (here), Supplementary Figure 9, I got the tskibd "true" IBD segments to give similar results to hap-ibd inferred IBD segments (for selection coefficient estimation). On the other hand, I would have expected (Table 1 and Supplementary Table 2 and other unpublished results) that the tskibd true IBD segments would not have biased selection coefficient estimates for s <= 0.02.

These observations are why I would agree with you: to get exact / highly accurate true IBD segment lengths you would have to have no gene conversion and set the sampling window to 1 or close to it.

Do you have any idea about the runtime and memory trade-offs as the decreases?

bguo068 commented 3 months ago

I am glad my explanation was helpful. I have not spend enough time on characterizing the relationship of run time with sample-window argument. But for small sample size, it won't be too bad. For large sample size, there will be more local trees and it can be slow.

Actually, I have been working on a separate project to implement the same function (calling true IBD) but fully take advantage of the high correlation between adjacent trees: for adjacent local trees, only a few edges are added, removed or rearranged . The new project aims to go through tree by tree (without skipping trees) so it can be extremely accurate but also very fast. The main algorithm has been coded but I will need to write more tests and make it highly parallelizable. I can share the code when I have it tested, maybe within a month.

sdtemple commented 3 months ago

Great. Thanks for all the responses and I look forward to your updates. I'll be doing some benchmarks on the tskibd runtime and accuracy for a project of mine as well, particularly as I play around with the sample window parameter. For accuracy, I would expect that the msprime + tskibd approach (w/o gene conversion) should get similar numbers of true IBD segments >= Y cM than the function simulate_ibd in isweep when you set the chromosome to be of size 2 x Y cM and calculate the number of IBD segments overlapping the middle of such chromosomes.

In some preliminary results, I am finding that the msprime + tskibd approach is much faster at getting IBD segments than ARGON.