Open RuBP17 opened 11 months ago
Hello @RuBP17, thanks for testing the sstar
analysis.
Regarding the first question,
not all the segments from the "src" are the introgression segments, some of them may not go into the target segments.
This is correct. However, here we used msprime
, which is a backward time simulator, for simulation. As mentioned in the tskit
tutorial for introgression, one should be careful in reverse time terminology.
Actually, these codes
for p in ts.populations():
source_id = [p.id for p in ts.populations() if p.metadata['name']==src_name][0]
target_id = [p.id for p in ts.populations() if p.metadata['name']==tgt_name][0]
for m in ts.migrations():
if m.dest == source_id: introgressed_tracts.append((int(m.left), int(m.right)))
are modified from the get_migrating_tracts
function in the above tutorial, replacing neanderthal_id
with source_id
:
def get_migrating_tracts(ts):
neanderthal_id = [p.id for p in ts.populations() if p.metadata['name']=='Neanderthal'][0]
migrating_tracts = []
# Get all tracts that migrated into the neanderthal population
for migration in ts.migrations():
if migration.dest == neanderthal_id:
migrating_tracts.append((migration.left, migration.right))
return np.array(migrating_tracts)
As explained in the tutorial, this function finds
the set of tracts that exist in the Eurasian genome because they have come from Neanderthals via admixture at time T_ad (be careful here: in reverse time terminology, we denote the “source” population as Eurasian and the “destination” population as Neanderthals). This is done simply by finding all migration records in which the “destination” population name is Neanderthal.
For the second question, the HumanNeanderthalDenisovan
demographic model was taken from stdpopsim
. As defined in the Catalog, NeaA
is the Altai Neanderthal lineage, and DenA
is the Altai Denisovan lineage, which are the sampled genomes, whereas Nea1
and Den1
are the actual populations providing introgression materials.
However, in this model, there are several introgressed populations.
Therefore, the truth tracts obtained from these codes:
rule get_tracts:
input:
ts = rules.simulation.output.ts,
output:
src1_tracts = output_dir + "simulated_data/{demog}/nref_{nref}/ntgt_{ntgt}/{seed}/sim2src.src1.introgressed.tracts.bed",
src2_tracts = output_dir + "simulated_data/{demog}/nref_{nref}/ntgt_{ntgt}/{seed}/sim2src.src2.introgressed.tracts.bed",
threads: 1,
resources: time_min=120, mem_mb=5000, cpus=1,
run:
ts = tskit.load(input.ts)
if wildcards.demog == 'HumanNeanderthalDenisovan':
src1_id = "Nea1"
src2_id = "Den1"
src3_id = "Den2"
tgt_id = "Papuan"
src3_tracts = output_dir + f'simulated_data/{wildcards.demog}/nref_{wildcards.nref}/ntgt_{wildcards.ntgt}/{wildcards.seed}/sim.src3.introgressed.tracts.bed'
get_introgressed_tracts(ts, chr_name=1, src_name=src1_id, tgt_name=tgt_id, output=output.src1_tracts)
get_introgressed_tracts(ts, chr_name=1, src_name=src2_id, tgt_name=tgt_id, output=output.src2_tracts)
get_introgressed_tracts(ts, chr_name=1, src_name=src3_id, tgt_name=tgt_id, output=src3_tracts)
a = pybedtools.BedTool(output.src2_tracts)
b = pybedtools.BedTool(src3_tracts)
a.cat(b).sort().merge().saveas(output.src2_tracts)
not only contain the introgressed fragments in Papuans
but also in the CHB
and Ghost
populations. Yes, these may affect the results of the performance comparison (@kuhlwilm).
Maybe we can use the tree to find out where the segments go.
source_id = [p.id for p in ts.populations() if p.metadata['name']==src_name][0]
target_id = [p.id for p in ts.populations() if p.metadata['name']==tgt_name][0]
Testpopulation = ts.get_samples(target_id)
for m in ts.migrations():
if m.dest == source_id:
for tree in ts.trees(leaf_lists=True):
# find the trees among the (m.left and m.right)
if m.left > tree.get_interval()[0]:
continue
if m.right <= tree.get_interval()[0]:
break
for l in tree.leaves(mr.node):
# use leave to find out where the segments go
if l in Testpopulation:
de_seg[l].append(tree.get_interval())
I tested this code under some simple models like HumanNeanderthal, It could find the same segments as the original code, because there is only one intro pop.
And I also tested this code under the complexed model HumanDenisovanNeanderthal, the output segments are much fewer than origin codes and the proportion of introgression segments seems more reasonable.
Could you please try the following codes and see whether you can get reasonable results?
def _get_true_tracts(ts, tgt_id, src_id, ploidy):
"""
Description:
Helper function to obtain ground truth introgressed tracts from tree-sequence.
Arguments:
ts tskit.TreeSqueuece: Tree-sequence containing ground truth introgressed tracts.
tgt_id str: Name of the target population.
src_id str: Name of the source population.
ploidy int: Ploidy of the genomes.
"""
tracts = {}
introgression = []
for p in ts.populations():
source_id = [p.id for p in ts.populations() if p.metadata['name']==src_id][0]
target_id = [p.id for p in ts.populations() if p.metadata['name']==tgt_id][0]
for i in range(ts.num_samples):
node = ts.node(i)
if node.population == target_id: tracts[node.id] = []
for m in ts.migrations():
if m.dest == source_id: introgression.append(m)
for i in introgression:
for t in ts.trees():
# Tree-sequences are sorted by the left ends of the intervals
# Can skip those tree-sequences are not overlapped with the interval of i.
if i.left >= t.interval.right: continue
if i.right <= t.interval.left: break # [l, r)
for n in tracts.keys():
left = i.left if i.left > t.interval.left else t.interval.left
right = i.right if i.right < t.interval.right else t.interval.right
if t.is_descendant(n, i.node): tracts[n].append([1, int(left), int(right), f'tsk_{ts.node(n).individual}_{int(n%ploidy+1)}'])
return tracts
yes, I followed your code and get reasonable results. The results were small intervals and I merged it.
def combine_segs(segs_dict):
combined_segs = {}
for node, segs in segs_dict.items():
tgt_id = segs[0][3]
segs_np = np.array(segs)[:,1:3].astype(int)
merged = np.empty([0, 2],dtype=np.int64)
sorted_segs = segs_np[np.argsort(segs_np[:, 1]), :]
for higher in sorted_segs:
if len(merged) == 0:
merged = np.vstack([merged, higher])
else:
lower = merged[-1, :]
if higher[0] <= lower[1]:
upper_bound = max(lower[1], higher[1])
merged[-1, :] = (lower[0], upper_bound)
else:
merged = np.vstack([merged, higher])
combined_segs[tgt_id] = merged.tolist()
return combined_segs
the output is like
{'tsk_10_1': [[4752671, 4822156],
[5515440, 5577464],
[5656534, 5782219],
[5786697, 5912558],
[9751616, 9776781]],
'tsk_10_2': [[1081574, 1155350],
[5515440, 5646920],
[5787124, 5848448],
[6489042, 6508794]]}
The results of origin code are
[(3194731, 3215569),
(990262, 996713),
(4815364, 4822156),
(2881951, 2913192),
(5546817, 5577464),
(5577464, 5602289),
(5607978, 5640348),
(8966597, 8969216),
(5885349, 5912558),
(4783723, 4799208),
(7249153, 7268325),
(7268325, 7269434),
(7269434, 7273435),
(7273435, 7274846),
(7274846, 7329602),
(4799208, 4815364),
(5640348, 5646920),
(914195, 990262),
(5536479, 5546817),
(1382099, 1386707),
(3122723, 3126090),
(8966176, 8966597),
(5515440, 5536479),
(8969216, 8996656),
(4881243, 4914243),
(7085758, 7107298),
(7199256, 7269434),
(7269434, 7281135),
(2653571, 2654761),
(3836413, 3870191),
(5604596, 5607978),
(7435721, 7446997),
(7482554, 7600984),
(3812997, 3823362),
(408670, 513053),
(618711, 641597),
(831758, 914195),
(3800141, 3812997),
(3782819, 3800141),
(3823362, 3825790),
(3825790, 3836413),
(2878930, 2881951),
(6489042, 6508794),
(5656534, 5782219),
(5602289, 5604596),
(1212642, 1225440),
(1225440, 1242849),
(5786697, 5787124),
(5787124, 5848448),
(5848448, 5885349),
(2263423, 2292123),
(9751616, 9776781),
(9999209, 10000000),
(3193501, 3194731),
(7239338, 7249153),
(9111095, 9138914),
(9138914, 9171257),
(3119511, 3122723),
(7329602, 7348552),
(3079017, 3110135),
(3870191, 3875974),
(7137745, 7150696),
(4752671, 4783723),
(1081574, 1155350),
(1242849, 1252345),
(1371432, 1382099),
(593168, 616656)]
Total length is 10**7 bp. the proportion of the introgressed segments of origin code is 0.08193445. the proportion of the introgressed segments of new code is 0.0347276.
In this function, all the introgression segments from the "src" are recorded. In fact, not all the segments from the "src" are the introgression segments, some of them may not go into the target segments. And the "tgt_id" is not used. This bug may occur under complexed demographic models like 2 src models.
In 2src simulation (HumanNeanderthalDenisovan)
The src samples are NeaA and DenA. But in 'get_tracts', the
src
become "Nea1", "Den1" and "Den2". Maybe there is something wrong.