First of all, thanks for the excellent tool and explicit and clear workflow explanation! It was a joy to run this pipeline!
I have a few minor comments:
When I input accidentally not phased data and put the -haploid option, it worked well without any exception, the tool found hap1 and hap2. Maybe you could put a simple stop execute exception for the same scattered people as me :moyai:
I see that you have 1Kb gap between Archaic and Human segments, while the length has end-start+1Kb, I guess meaning that the 1Kb space falls into the previous region since you count SNPs here. I think I can feel some logic behind it, but anyway it's confusing - having a length longer that the region itself. I read the workflow some time ago, and I can't recall now whether it was some information about this, so it might be my bad though :upside_down_face:
This place is here https://github.com/LauritsSkov/Introgression-detection/blob/f88425ef96e68b7c1af26ff513bca4c285fe8d77/src/hmm_functions.py#L224
chrom start end length state mean_prob snps
chr1 0 7233000 7234000 Human 0.9995 287
chr1 7234000 7246000 13000 Archaic 0.90427 9
chr1 7247000 21618000 14372000 Human 0.99946 610
chr1 21619000 21673000 55000 Archaic 0.9697 22
I was comparing my introgression map haplotypes and inferred by this tool, and I see a mismatch in the number of archaic SNPs of my and your annotation in some regions, even examples where I have some archaic variants, and your annotation have 0 (but overall it's extra minor cases). It stands I guess on the fact that you don't base the analysis on any reintroduced alleles, but due to higher SNP density you anyway gather such regions and mark as Archaic. And next, you can't annotate these regions with archaic alleles when these archaic positions are ancestral reintroduced since they are not present in an observation sample file.
I don't sure 100%, but for instance, I found a region with 10 SNPs with 0.1 MAF forming haplotypes, all of them not present in YRI but present in Archaics, and these archaic alleles are ancestral. And when I looked into a person's haplotype in this region that is marked as Archaic by your tool, I see that in "observed" SNPs there are not any of these 10 SNPs even though they fall into the region. While the rest (30 SNPs) are not present in Archaics, so in annotation I got 0.
(I apologize I have a headache today, hope the text above makes sense)
Hello!
Sorry for the late reply (better late than never!)
No sorry this relies on the user remembering if the input is phased or not :P
I have updated the start and end coordinates so end now is 1000 bp larger!
Yeah hmmix does not consider reintroduced alleles nor does it consider what the MAF is in the outgroup. One would have to rewrite the program to allow for that!
Hello!
First of all, thanks for the excellent tool and explicit and clear workflow explanation! It was a joy to run this pipeline! I have a few minor comments:
When I input accidentally not phased data and put the -haploid option, it worked well without any exception, the tool found hap1 and hap2. Maybe you could put a simple stop execute exception for the same scattered people as me :moyai:
I see that you have 1Kb gap between Archaic and Human segments, while the length has end-start+1Kb, I guess meaning that the 1Kb space falls into the previous region since you count SNPs here. I think I can feel some logic behind it, but anyway it's confusing - having a length longer that the region itself. I read the workflow some time ago, and I can't recall now whether it was some information about this, so it might be my bad though :upside_down_face: This place is here https://github.com/LauritsSkov/Introgression-detection/blob/f88425ef96e68b7c1af26ff513bca4c285fe8d77/src/hmm_functions.py#L224
I was comparing my introgression map haplotypes and inferred by this tool, and I see a mismatch in the number of archaic SNPs of my and your annotation in some regions, even examples where I have some archaic variants, and your annotation have 0 (but overall it's extra minor cases). It stands I guess on the fact that you don't base the analysis on any reintroduced alleles, but due to higher SNP density you anyway gather such regions and mark as Archaic. And next, you can't annotate these regions with archaic alleles when these archaic positions are ancestral reintroduced since they are not present in an observation sample file. I don't sure 100%, but for instance, I found a region with 10 SNPs with 0.1 MAF forming haplotypes, all of them not present in YRI but present in Archaics, and these archaic alleles are ancestral. And when I looked into a person's haplotype in this region that is marked as Archaic by your tool, I see that in "observed" SNPs there are not any of these 10 SNPs even though they fall into the region. While the rest (30 SNPs) are not present in Archaics, so in annotation I got 0.
(I apologize I have a headache today, hope the text above makes sense)