bguo068 / tskibd

Calculate IBD segments from tree sequence using tskit C API
MIT License
1 stars 0 forks source link

update ibd matrix based on a relevant subset of ancestors #2

Open bguo068 opened 1 year ago

bguo068 commented 1 year ago

Adjacent trees are highly correlated, with the difference being a few deleted/added edges. From one tree to the next, we only need to update the IBD matrix between pairs of leaves of subtrees rooted at different child nodes of the parent node of a deleted edge. From the Kelleher et al 2016 paper, fig4, we can sort the edge table by the 'right' coordinate and obtain which edges are deleted. We collect all 'parent' nodes of the delete edges and only update IBD matrix elements due to the changes of these nodes.

This should also work when we are only compare trees at two adjacent sampling points. In this case, we can multiple local trees between the sampling points, we update the IBD matrix related to parents nodes that have changed at any tree iteration between the two sampling points.

bguo068 commented 1 year ago
import msprime
import numpy as np
import pandas as pd

ts = msprime.sim_ancestry(
    samples=1000, # diploids
    recombination_rate=1e-8,
    sequence_length=100 * 1e6,
    population_size=10_000,
)

# turn edge table to pandas dataframe
d = ts.tables.edges.asdict()
d.pop('metadata')
d.pop("metadata_offset")
d.pop("metadata_schema")
et = pd.DataFrame(d)

# sort edge table according to 'right' (the order of edges being deleted)
# during tree iteration
et1=et.sort_values(['right', 'parent'])

# get a list tree end positions
ends = np.sort(et.right.unique())

# a total number of relevant parent nodes between sampling points
# separated by 30 local trees. This is equivalent to 0.01 cM step size.
sz_lst = []
for i in range(len(ends) - 40):
    sz = et1.loc[(et1.right>=ends[i]) & (et1.right<ends[i+30]) ,'parent'].unique().size
    sz_lst.append(sz)
print(np.unique(sz_lst, return_counts=True))

# 
# (array([39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
#         56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
#         73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86]),
#  array([    3,     3,     3,    10,    18,    56,    79,   124,   235,
#           397,   692,  1081,  1572,  2269,  2981,  4115,  5378,  7095,
#          8614, 10861, 12873, 14811, 16791, 18457, 19653, 20746, 20892,
#         20197, 19215, 17390, 15852, 13518, 11430,  9234,  7270,  5298,
#          3999,  2819,  1859,  1197,   713,   436,   226,   121,    44,
#            20,     4,     1]))