Closest sample to another sample, span-wise

hyanwong commented 1 year ago

I just had an interesting idea: If we don't know who is the closest relative of an individual over the whole ARG, we could collect the spans over which each sample node is another sample's "nearest neighbour". Something like this:

import numpy as np
import msprime

ts = msprime.sim_ancestry(10, sequence_length=1e8, recombination_rate=1e-4, random_seed=1)
print(ts.num_trees, "trees in the tree sequence")
sample_span = np.zeros((ts.num_samples, ts.num_samples))  # careful, this is quadratic in sample size

for tree in ts.simplify().trees(sample_lists=True):
    for u in ts.samples():
        for nearest in tree.samples(tree.parent(u)):
            if nearest != u:
                sample_span[u, nearest] += tree.span
sample_span = sample_span / ts.sequence_length

for i, s in enumerate(sample_span):
    sample_node_id = ts.samples()[i]
    closest = {i: span for i, span in enumerate(s)}
    closest = {i[0]: i[1] for i in sorted(closest.items(), key=lambda x: x[1], reverse=True)}
    print(f"Closest samples to sample_node {sample_node_id}: {closest}\n")

hyanwong commented 1 year ago

Here's an interesting plot of that statistic, for that particular example (msprime.sim_ancestry(10, sequence_length=1e8, recombination_rate=1e-4, random_seed=1))

import matplotlib.pyplot as plt

# for simplicity, only plot the top 3 relatives for each sample
top_3 = np.argsort(sample_span, axis=1)[:, -1:-4:-1]
width = 0.5

fig, ax = plt.subplots()
bottom = np.zeros(ts.num_samples)

for indexes in top_3.T:
    vals = [sample_span[i, j] for i, j in enumerate(indexes)]
    p = ax.bar(ts.samples(), vals, width, bottom=bottom)
    bottom += vals

print("13", sample_span[13, :])

I have printed out the value for sample node 13, and indeed it seems like it has the same closest relative over the entire tree sequence (11271 trees!), to the exclusion of any other samples:

13 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]

So even in this case, when we have no simulated sibling structure (and random mating), we have some samples for whom the closest relative is unique and the same over the entire genome.

hyanwong commented 1 year ago

The weirdness of having a single close relative over the entire genome may be a feature of the default ("Moran"-type) continuous timescale in msprime. If nodes 13 and 18 coalesced with each other within a fraction of a "generation", there would be no time for recombination to occur.

It might be that if we ran a wright-fisher model instead, we wouldn't get this.

duncanMR commented 1 year ago

Thanks for the great idea and implementation! I just wrote a function to perform this plot in the 1000 Genomes notebook. I wanted to color the bars by which sample they refer to; I used seaborn because it seemed it would be easier to do (and my matplotlib wizardry skills are poor). I think making the bars the same height makes it easier to read. I look forward to seeing if it works well on data with family relationships.

hyanwong commented 1 year ago

Nice. I wonder if making the bars the same height will, however, obscure the cases where the nearest neighbour is, on average, half the genome (i.e. sibs) versus 1/4, versus 1/8th etc? It probably doesn't matter for a single chromosome, but may if you are integrating across many chromosomes.

duncanMR commented 1 year ago

That's a fair point! I will make the default behaviour to not make them the same height.

duncanMR / pedigrees_project

Closest sample to another sample, span-wise #1