Developing methods to identify pedigree structure in tree sequence data created from Whole Genome Sequencing samples. Part of my rotation project with Dr Jerome Kelleher's lab at the Big Data Insitute, University of Oxford.
Tskit has a bunch of built-in methods for looking at the topology of tree sequences. Count_topologies allows us to fix a set of samples of interest, say 0 and 1, and determine the topology of each simplified three-leaved tree formed by adding another of the remaining samples (which we will call x). If 0 and 1 are closely related, we expect the topology in which they are siblings to be the most common across the tree sequence.
I made a big tree sequence and took a random subset of 4 individuals from it:
import tskit
import msprime
ts_big = msprime.sim_ancestry(40, sequence_length=50000, recombination_rate=1e-4, random_seed=3)
individuals_sample = random.sample(range(ts_big.num_individuals-1),4)
ls = []
for m in individuals_sample:
ls.append(ts_big.individual(m).nodes)
nodes_sample = np.concatenate(ls)
ts2 = ts_big.simplify(nodes_sample)
I then used the plotting method from #1 to identify samples that tended to be nearby each other in the tree sequence
I chose samples 5 and 2 to focus on for this example. I wrote a function to count the topologies for a given sample set, along with the genomic region for each count:
def count_topologies(ts, sample_list):
### Loop through the trees in the ts, recording the count of every
### possible topology in each tree into a dataframe
sample_sets = make_sample_sets(ts, sample_list)
all_ranks = [ts.rank() for ts in tskit.all_trees(num_leaves = len(sample_list)+1)]
all_nodes = tuple(range(len(sample_list)+1)) #to extract topologies including all nodes
counts_df = pd.DataFrame(np.zeros((ts.num_trees, len(all_ranks))), columns=all_ranks).astype(int)
for tree, tc in enumerate(ts.count_topologies(sample_sets=sample_sets)):
for rank, count in tc.topologies[all_nodes].items():
counts_df[rank][tree] = count
counts_df['LeftInterval'] = [t.interval[0] for t in ts.trees()]
counts_df['RightInterval'] = [t.interval[1] for t in ts.trees()]
counts_df['Width'] = counts_df['RightInterval'] - counts_df['LeftInterval']
return counts_df
I designed a plot that counts the n most common topologies across the tree sequence and plots their counts in a stacked bar chart. The legend shows what the topologies look like. With a lot of help from AI, I figured it out in this notebook.
As we would expect from the span-wise plot, samples 5 and 2 are mostly in a subtree together except for the last fifth of the genomic region.
Next step: I'm working on writing a self-contained plotting function to make this topologies plot that can work with arbitrary numbers of topologies and sample sets.
Tskit has a bunch of built-in methods for looking at the topology of tree sequences. Count_topologies allows us to fix a set of samples of interest, say 0 and 1, and determine the topology of each simplified three-leaved tree formed by adding another of the remaining samples (which we will call x). If 0 and 1 are closely related, we expect the topology in which they are siblings to be the most common across the tree sequence.
I made a big tree sequence and took a random subset of 4 individuals from it:
I then used the plotting method from #1 to identify samples that tended to be nearby each other in the tree sequence
I chose samples 5 and 2 to focus on for this example. I wrote a function to count the topologies for a given sample set, along with the genomic region for each count:
I designed a plot that counts the n most common topologies across the tree sequence and plots their counts in a stacked bar chart. The legend shows what the topologies look like. With a lot of help from AI, I figured it out in this notebook.
As we would expect from the span-wise plot, samples 5 and 2 are mostly in a subtree together except for the last fifth of the genomic region.
Next step: I'm working on writing a self-contained plotting function to make this topologies plot that can work with arbitrary numbers of topologies and sample sets.