Counting topologies plot

Tskit has a bunch of built-in methods for looking at the topology of tree sequences. Count_topologies allows us to fix a set of samples of interest, say 0 and 1, and determine the topology of each simplified three-leaved tree formed by adding another of the remaining samples (which we will call x). If 0 and 1 are closely related, we expect the topology in which they are siblings to be the most common across the tree sequence.

I made a big tree sequence and took a random subset of 4 individuals from it:

import tskit 
import msprime

ts_big = msprime.sim_ancestry(40, sequence_length=50000, recombination_rate=1e-4, random_seed=3)
individuals_sample = random.sample(range(ts_big.num_individuals-1),4)
ls = []
for m in individuals_sample:
    ls.append(ts_big.individual(m).nodes)
nodes_sample = np.concatenate(ls)
ts2 = ts_big.simplify(nodes_sample)

I then used the plotting method from #1 to identify samples that tended to be nearby each other in the tree sequence

span_df = make_span_df(ts2, 3)
plot_span_df(span_df, "Paired")

I chose samples 5 and 2 to focus on for this example. I wrote a function to count the topologies for a given sample set, along with the genomic region for each count:

def count_topologies(ts, sample_list):
    ### Loop through the trees in the ts, recording the count of every
    ### possible topology in each tree into a dataframe

    sample_sets = make_sample_sets(ts, sample_list)
    all_ranks = [ts.rank() for ts in tskit.all_trees(num_leaves = len(sample_list)+1)]
    all_nodes = tuple(range(len(sample_list)+1)) #to extract topologies including all nodes
    counts_df = pd.DataFrame(np.zeros((ts.num_trees, len(all_ranks))), columns=all_ranks).astype(int)

    for tree, tc in enumerate(ts.count_topologies(sample_sets=sample_sets)):
        for rank, count in tc.topologies[all_nodes].items():
            counts_df[rank][tree] = count

    counts_df['LeftInterval'] = [t.interval[0] for t in ts.trees()]
    counts_df['RightInterval'] = [t.interval[1] for t in ts.trees()]
    counts_df['Width'] = counts_df['RightInterval'] - counts_df['LeftInterval']
    return counts_df

I designed a plot that counts the n most common topologies across the tree sequence and plots their counts in a stacked bar chart. The legend shows what the topologies look like. With a lot of help from AI, I figured it out in this notebook.

As we would expect from the span-wise plot, samples 5 and 2 are mostly in a subtree together except for the last fifth of the genomic region.

Next step: I'm working on writing a self-contained plotting function to make this topologies plot that can work with arbitrary numbers of topologies and sample sets.

duncanMR / pedigrees_project

Counting topologies plot #2