duncanMR / pedigrees_project

Developing methods to identify pedigree structure in tree sequence data created from Whole Genome Sequencing samples. Part of my rotation project with Dr Jerome Kelleher's lab at the Big Data Insitute, University of Oxford.
0 stars 0 forks source link

Counting topologies plot #2

Open duncanMR opened 1 year ago

duncanMR commented 1 year ago

Tskit has a bunch of built-in methods for looking at the topology of tree sequences. Count_topologies allows us to fix a set of samples of interest, say 0 and 1, and determine the topology of each simplified three-leaved tree formed by adding another of the remaining samples (which we will call x). If 0 and 1 are closely related, we expect the topology in which they are siblings to be the most common across the tree sequence.

I made a big tree sequence and took a random subset of 4 individuals from it:

import tskit 
import msprime

ts_big = msprime.sim_ancestry(40, sequence_length=50000, recombination_rate=1e-4, random_seed=3)
individuals_sample = random.sample(range(ts_big.num_individuals-1),4)
ls = []
for m in individuals_sample:
    ls.append(ts_big.individual(m).nodes)
nodes_sample = np.concatenate(ls)
ts2 = ts_big.simplify(nodes_sample)

I then used the plotting method from #1 to identify samples that tended to be nearby each other in the tree sequence

span_df = make_span_df(ts2, 3)
plot_span_df(span_df, "Paired")

image

I chose samples 5 and 2 to focus on for this example. I wrote a function to count the topologies for a given sample set, along with the genomic region for each count:

def count_topologies(ts, sample_list):
    ### Loop through the trees in the ts, recording the count of every
    ### possible topology in each tree into a dataframe

    sample_sets = make_sample_sets(ts, sample_list)
    all_ranks = [ts.rank() for ts in tskit.all_trees(num_leaves = len(sample_list)+1)]
    all_nodes = tuple(range(len(sample_list)+1)) #to extract topologies including all nodes
    counts_df = pd.DataFrame(np.zeros((ts.num_trees, len(all_ranks))), columns=all_ranks).astype(int)

    for tree, tc in enumerate(ts.count_topologies(sample_sets=sample_sets)):
        for rank, count in tc.topologies[all_nodes].items():
            counts_df[rank][tree] = count

    counts_df['LeftInterval'] = [t.interval[0] for t in ts.trees()]
    counts_df['RightInterval'] = [t.interval[1] for t in ts.trees()]
    counts_df['Width'] = counts_df['RightInterval'] - counts_df['LeftInterval']
    return counts_df

I designed a plot that counts the n most common topologies across the tree sequence and plots their counts in a stacked bar chart. The legend shows what the topologies look like. With a lot of help from AI, I figured it out in this notebook. image

As we would expect from the span-wise plot, samples 5 and 2 are mostly in a subtree together except for the last fifth of the genomic region.

Next step: I'm working on writing a self-contained plotting function to make this topologies plot that can work with arbitrary numbers of topologies and sample sets.