May be some bugs in simulations.

RuBP17 commented 11 months ago

def get_introgressed_tracts(ts, chr_name, src_name, tgt_name, output):
    """
    Description:
        Outputs true introgressed tracts from a tree-sequence into a BED file.

    Arguments:
        ts tskit.TreeSequence: Tree-sequence containing introgressed tracts.
        chr_name int: Name of the chromosome.
        src_name str: Name of the source population.
        tgt_name str: Name of the target population.
        output string: Name of the output file.
    """
    introgressed_tracts = []

    for p in ts.populations():
        source_id = [p.id for p in ts.populations() if p.metadata['name']==src_name][0]
        target_id = [p.id for p in ts.populations() if p.metadata['name']==tgt_name][0]

    for m in ts.migrations():
        if m.dest == source_id: introgressed_tracts.append((int(m.left), int(m.right)))

    introgressed_tracts.sort(key=lambda x:x[0])
    with open(output, 'w') as o:
        for t in introgressed_tracts:
            o.write(f'1\t{t[0]}\t{t[1]}\n')

In this function, all the introgression segments from the "src" are recorded. In fact, not all the segments from the "src" are the introgression segments, some of them may not go into the target segments. And the "tgt_id" is not used. This bug may occur under complexed demographic models like 2 src models.

In 2src simulation (HumanNeanderthalDenisovan)

if wildcards.demog == 'HumanNeanderthalDenisovan':
            graph = demes.load("config/simulation/models/HumanNeanderthalDenisovan_PapuansOutOfAfrica_10J19.yaml")
            demography = msprime.Demography.from_demes(graph)
            ref_id = 'YRI'
            tgt_id = 'Papuan'
            src1_id = 'NeaA'
            src2_id = 'DenA'
            mutation_rate = 1.4e-8
            recombination_rate = 1e-8

            T_DenA = 59682 / 29
            T_NeaA = 75748 / 29

            samples = [
                msprime.SampleSet(int(wildcards.nref), ploidy=2, population=ref_id),
                msprime.SampleSet(int(wildcards.ntgt), ploidy=2, population=tgt_id),
                msprime.SampleSet(1, ploidy=2, population=src1_id, time=T_NeaA),
                msprime.SampleSet(1, ploidy=2, population=src2_id, time=T_DenA),
            ]

The src samples are NeaA and DenA. But in 'get_tracts', the src become "Nea1", "Den1" and "Den2". Maybe there is something wrong.

rule get_tracts:
    input:
        ts = rules.simulation.output.ts,
    output:
        src1_tracts = output_dir + "simulated_data/{demog}/nref_{nref}/ntgt_{ntgt}/{seed}/sim2src.src1.introgressed.tracts.bed",
        src2_tracts = output_dir + "simulated_data/{demog}/nref_{nref}/ntgt_{ntgt}/{seed}/sim2src.src2.introgressed.tracts.bed",
    threads: 1,
    resources: time_min=120, mem_mb=5000, cpus=1,
    run:
        ts = tskit.load(input.ts)

        if wildcards.demog == 'HumanNeanderthalDenisovan':
            src1_id = "Nea1"
            src2_id = "Den1"
            src3_id = "Den2"
            tgt_id = "Papuan"

            src3_tracts = output_dir + f'simulated_data/{wildcards.demog}/nref_{wildcards.nref}/ntgt_{wildcards.ntgt}/{wildcards.seed}/sim.src3.introgressed.tracts.bed'
            get_introgressed_tracts(ts, chr_name=1, src_name=src1_id, tgt_name=tgt_id, output=output.src1_tracts)
            get_introgressed_tracts(ts, chr_name=1, src_name=src2_id, tgt_name=tgt_id, output=output.src2_tracts)
            get_introgressed_tracts(ts, chr_name=1, src_name=src3_id, tgt_name=tgt_id, output=src3_tracts)

            a = pybedtools.BedTool(output.src2_tracts)
            b = pybedtools.BedTool(src3_tracts)
            a.cat(b).sort().merge().saveas(output.src2_tracts)

xin-huang commented 11 months ago

Hello @RuBP17, thanks for testing the sstar analysis.

Regarding the first question,

not all the segments from the "src" are the introgression segments, some of them may not go into the target segments.

This is correct. However, here we used msprime, which is a backward time simulator, for simulation. As mentioned in the tskit tutorial for introgression, one should be careful in reverse time terminology.

Actually, these codes

for p in ts.populations():
        source_id = [p.id for p in ts.populations() if p.metadata['name']==src_name][0]
        target_id = [p.id for p in ts.populations() if p.metadata['name']==tgt_name][0]

 for m in ts.migrations():
        if m.dest == source_id: introgressed_tracts.append((int(m.left), int(m.right)))

are modified from the get_migrating_tracts function in the above tutorial, replacing neanderthal_id with source_id:

def get_migrating_tracts(ts):
    neanderthal_id = [p.id for p in ts.populations() if p.metadata['name']=='Neanderthal'][0]
    migrating_tracts = []
    # Get all tracts that migrated into the neanderthal population
    for migration in ts.migrations():
        if migration.dest == neanderthal_id:
            migrating_tracts.append((migration.left, migration.right))
    return np.array(migrating_tracts)

As explained in the tutorial, this function finds

the set of tracts that exist in the Eurasian genome because they have come from Neanderthals via admixture at time T_ad (be careful here: in reverse time terminology, we denote the “source” population as Eurasian and the “destination” population as Neanderthals). This is done simply by finding all migration records in which the “destination” population name is Neanderthal.

For the second question, the HumanNeanderthalDenisovan demographic model was taken from stdpopsim. As defined in the Catalog, NeaA is the Altai Neanderthal lineage, and DenA is the Altai Denisovan lineage, which are the sampled genomes, whereas Nea1 and Den1 are the actual populations providing introgression materials.

However, in this model, there are several introgressed populations.

https://github.com/admixVIE/sstar-analysis/blob/1b5628df5e9e89f0a16801081d473fa7060dc98d/config/simulation/models/HumanNeanderthalDenisovan_PapuansOutOfAfrica_10J19.yaml#L85-L91

Therefore, the truth tracts obtained from these codes:

rule get_tracts:
    input:
        ts = rules.simulation.output.ts,
    output:
        src1_tracts = output_dir + "simulated_data/{demog}/nref_{nref}/ntgt_{ntgt}/{seed}/sim2src.src1.introgressed.tracts.bed",
        src2_tracts = output_dir + "simulated_data/{demog}/nref_{nref}/ntgt_{ntgt}/{seed}/sim2src.src2.introgressed.tracts.bed",
    threads: 1,
    resources: time_min=120, mem_mb=5000, cpus=1,
    run:
        ts = tskit.load(input.ts)

        if wildcards.demog == 'HumanNeanderthalDenisovan':
            src1_id = "Nea1"
            src2_id = "Den1"
            src3_id = "Den2"
            tgt_id = "Papuan"

            src3_tracts = output_dir + f'simulated_data/{wildcards.demog}/nref_{wildcards.nref}/ntgt_{wildcards.ntgt}/{wildcards.seed}/sim.src3.introgressed.tracts.bed'
            get_introgressed_tracts(ts, chr_name=1, src_name=src1_id, tgt_name=tgt_id, output=output.src1_tracts)
            get_introgressed_tracts(ts, chr_name=1, src_name=src2_id, tgt_name=tgt_id, output=output.src2_tracts)
            get_introgressed_tracts(ts, chr_name=1, src_name=src3_id, tgt_name=tgt_id, output=src3_tracts)

            a = pybedtools.BedTool(output.src2_tracts)
            b = pybedtools.BedTool(src3_tracts)
            a.cat(b).sort().merge().saveas(output.src2_tracts)

not only contain the introgressed fragments in Papuans but also in the CHB and Ghost populations. Yes, these may affect the results of the performance comparison (@kuhlwilm).

RuBP17 commented 11 months ago

Maybe we can use the tree to find out where the segments go.

source_id = [p.id for p in ts.populations() if p.metadata['name']==src_name][0]
target_id = [p.id for p in ts.populations() if p.metadata['name']==tgt_name][0]

Testpopulation = ts.get_samples(target_id)
for m in ts.migrations():
    if m.dest == source_id:
        for tree in ts.trees(leaf_lists=True):
                        # find the trees among the (m.left and m.right)
            if m.left > tree.get_interval()[0]:
                continue
            if m.right <= tree.get_interval()[0]:
                break

            for l in tree.leaves(mr.node):
                        # use leave to find out where the segments go
                if l in Testpopulation:
                    de_seg[l].append(tree.get_interval())

I tested this code under some simple models like HumanNeanderthal, It could find the same segments as the original code, because there is only one intro pop.

And I also tested this code under the complexed model HumanDenisovanNeanderthal, the output segments are much fewer than origin codes and the proportion of introgression segments seems more reasonable.

xin-huang commented 11 months ago

Could you please try the following codes and see whether you can get reasonable results?

def _get_true_tracts(ts, tgt_id, src_id, ploidy):
    """
    Description:
        Helper function to obtain ground truth introgressed tracts from tree-sequence.

    Arguments:
        ts tskit.TreeSqueuece: Tree-sequence containing ground truth introgressed tracts.
        tgt_id str: Name of the target population. 
        src_id str: Name of the source population.
        ploidy int: Ploidy of the genomes.
    """
    tracts = {}
    introgression = []

    for p in ts.populations():
        source_id = [p.id for p in ts.populations() if p.metadata['name']==src_id][0]
        target_id = [p.id for p in ts.populations() if p.metadata['name']==tgt_id][0]

    for i in range(ts.num_samples):
        node = ts.node(i)
        if node.population == target_id: tracts[node.id] = []

    for m in ts.migrations():
        if m.dest == source_id: introgression.append(m)

    for i in introgression:
        for t in ts.trees():
            # Tree-sequences are sorted by the left ends of the intervals
            # Can skip those tree-sequences are not overlapped with the interval of i.
            if i.left >= t.interval.right: continue
            if i.right <= t.interval.left: break # [l, r)
            for n in tracts.keys():
                left = i.left if i.left > t.interval.left else t.interval.left
                right = i.right if i.right < t.interval.right else t.interval.right
                if t.is_descendant(n, i.node): tracts[n].append([1, int(left), int(right), f'tsk_{ts.node(n).individual}_{int(n%ploidy+1)}'])

    return tracts

RuBP17 commented 11 months ago

yes, I followed your code and get reasonable results. The results were small intervals and I merged it.

def combine_segs(segs_dict):
    combined_segs = {}
    for node, segs in segs_dict.items():
        tgt_id = segs[0][3]
        segs_np = np.array(segs)[:,1:3].astype(int)
        merged = np.empty([0, 2],dtype=np.int64)
        sorted_segs = segs_np[np.argsort(segs_np[:, 1]), :]
        for higher in sorted_segs:
            if len(merged) == 0:
                merged = np.vstack([merged, higher])
            else:
                lower = merged[-1, :]
                if higher[0] <= lower[1]:
                    upper_bound = max(lower[1], higher[1])
                    merged[-1, :] = (lower[0], upper_bound)
                else:
                    merged = np.vstack([merged, higher])
        combined_segs[tgt_id] = merged.tolist()
    return combined_segs

the output is like

{'tsk_10_1': [[4752671, 4822156],
  [5515440, 5577464],
  [5656534, 5782219],
  [5786697, 5912558],
  [9751616, 9776781]],
 'tsk_10_2': [[1081574, 1155350],
  [5515440, 5646920],
  [5787124, 5848448],
  [6489042, 6508794]]}

The results of origin code are

[(3194731, 3215569),
 (990262, 996713),
 (4815364, 4822156),
 (2881951, 2913192),
 (5546817, 5577464),
 (5577464, 5602289),
 (5607978, 5640348),
 (8966597, 8969216),
 (5885349, 5912558),
 (4783723, 4799208),
 (7249153, 7268325),
 (7268325, 7269434),
 (7269434, 7273435),
 (7273435, 7274846),
 (7274846, 7329602),
 (4799208, 4815364),
 (5640348, 5646920),
 (914195, 990262),
 (5536479, 5546817),
 (1382099, 1386707),
 (3122723, 3126090),
 (8966176, 8966597),
 (5515440, 5536479),
 (8969216, 8996656),
 (4881243, 4914243),
 (7085758, 7107298),
 (7199256, 7269434),
 (7269434, 7281135),
 (2653571, 2654761),
 (3836413, 3870191),
 (5604596, 5607978),
 (7435721, 7446997),
 (7482554, 7600984),
 (3812997, 3823362),
 (408670, 513053),
 (618711, 641597),
 (831758, 914195),
 (3800141, 3812997),
 (3782819, 3800141),
 (3823362, 3825790),
 (3825790, 3836413),
 (2878930, 2881951),
 (6489042, 6508794),
 (5656534, 5782219),
 (5602289, 5604596),
 (1212642, 1225440),
 (1225440, 1242849),
 (5786697, 5787124),
 (5787124, 5848448),
 (5848448, 5885349),
 (2263423, 2292123),
 (9751616, 9776781),
 (9999209, 10000000),
 (3193501, 3194731),
 (7239338, 7249153),
 (9111095, 9138914),
 (9138914, 9171257),
 (3119511, 3122723),
 (7329602, 7348552),
 (3079017, 3110135),
 (3870191, 3875974),
 (7137745, 7150696),
 (4752671, 4783723),
 (1081574, 1155350),
 (1242849, 1252345),
 (1371432, 1382099),
 (593168, 616656)]

Total length is 10**7 bp. the proportion of the introgressed segments of origin code is 0.08193445. the proportion of the introgressed segments of new code is 0.0347276.

admixVIE / sstar-analysis

May be some bugs in simulations. #4