LauritsSkov / Introgression-detection

MIT License
36 stars 12 forks source link

length + reintroduced alleles #4

Closed SillySabertooth closed 2 months ago

SillySabertooth commented 1 year ago

Hello!

First of all, thanks for the excellent tool and explicit and clear workflow explanation! It was a joy to run this pipeline! I have a few minor comments:

  1. When I input accidentally not phased data and put the -haploid option, it worked well without any exception, the tool found hap1 and hap2. Maybe you could put a simple stop execute exception for the same scattered people as me :moyai:

  2. I see that you have 1Kb gap between Archaic and Human segments, while the length has end-start+1Kb, I guess meaning that the 1Kb space falls into the previous region since you count SNPs here. I think I can feel some logic behind it, but anyway it's confusing - having a length longer that the region itself. I read the workflow some time ago, and I can't recall now whether it was some information about this, so it might be my bad though :upside_down_face: This place is here https://github.com/LauritsSkov/Introgression-detection/blob/f88425ef96e68b7c1af26ff513bca4c285fe8d77/src/hmm_functions.py#L224

    chrom  start     end       length    state    mean_prob  snps
    chr1   0         7233000   7234000   Human    0.9995     287
    chr1   7234000   7246000   13000     Archaic  0.90427    9
    chr1   7247000   21618000  14372000  Human    0.99946    610
    chr1   21619000  21673000  55000     Archaic  0.9697     22
  3. I was comparing my introgression map haplotypes and inferred by this tool, and I see a mismatch in the number of archaic SNPs of my and your annotation in some regions, even examples where I have some archaic variants, and your annotation have 0 (but overall it's extra minor cases). It stands I guess on the fact that you don't base the analysis on any reintroduced alleles, but due to higher SNP density you anyway gather such regions and mark as Archaic. And next, you can't annotate these regions with archaic alleles when these archaic positions are ancestral reintroduced since they are not present in an observation sample file. I don't sure 100%, but for instance, I found a region with 10 SNPs with 0.1 MAF forming haplotypes, all of them not present in YRI but present in Archaics, and these archaic alleles are ancestral. And when I looked into a person's haplotype in this region that is marked as Archaic by your tool, I see that in "observed" SNPs there are not any of these 10 SNPs even though they fall into the region. While the rest (30 SNPs) are not present in Archaics, so in annotation I got 0.

(I apologize I have a headache today, hope the text above makes sense)

David-Peede commented 8 months ago

@SillySabertooth this pull request should resolve your point second issue!

LauritsSkov commented 2 months ago

Hello! Sorry for the late reply (better late than never!)

  1. No sorry this relies on the user remembering if the input is phased or not :P
  2. I have updated the start and end coordinates so end now is 1000 bp larger!
  3. Yeah hmmix does not consider reintroduced alleles nor does it consider what the MAF is in the outgroup. One would have to rewrite the program to allow for that!