dpuiu / MitoHPC

MIT License
10 stars 12 forks source link

haplocheck and haplogrep issues #14

Open squigzzz opened 8 months ago

squigzzz commented 8 months ago

there was a pretty significant change made to the code in the past 2 months which got haplocheck working properly, not sure what that was but thank you for that.

I wanted to drop some notes here based on my research, according to the haplocheck documentation https://mitoverse.readthedocs.io/haplocheck/haplocheck/, it requires vcf genotypes to be encoded as 0/1 for heteroplasmic sites and 1/1 for homoplasmic sites however, the way mutect2 is currently being implemented, its not using mitochondria mode which means every variant site is being encoded as homozygous or heterozygous as it is assuming these sequences are from the nuclear genome and are diploid. When you ingest this mutect2 generated vcf into haplocheck nearly every sample will come back as contaminated since every site is essentially heteroplasmic. In addition, within the mitoHPC publication itself within your figure 2 flowchart you mention how haplocheck should be used only with the rCRS reference however in the code implementation if you try to run the pipeline using rCRS instead of chrM it crashes with a samtools header error. As such I believe the mutect2 output should never be used with haplocheck, atleast not in its current implementation. Would appreciate any feedback, comments. Thanks.

More practically, I didnt observe any major differences when using the chrM reference as opposed to rCRS for calling variants using mutserve.

squigzzz commented 8 months ago

also as a follow up to this, I am curious as to why mitochondria mode is not being used with the mutect2 implementation you have currently embedded. There are several advantages to this pipeline, but perhaps the entire logic of a sample specific reference needs to be rebuilt.

dpuiu commented 8 months ago

Hi squigzzz and thank you so much for your feedback!

Sure, we will try to update the pipeline, set "GT=1/1" for the homoplasmic calls and further check the haplocheck functionality.

Tests on simulated data performed better when using plain "Mutect2" than "Mutect2 --mitochondria-mode" . The latter, would miss some of the simulated SNVs, it is why we did not use it.

However, the "mitochondria-mode" can easily be enabled by setting export HP_GOPT="--mitochondria-mode" in the "init.sh" file

squigzzz commented 8 months ago

Thank you for getting back to me so quickly, I greatly appreciate it. I am also curious whether in the current implementation therer is any filtering thats being done to the raw VCFs prior to ingesting them into haplocheck. Since some of the variant calls could be annotated as NUMTS or other sequencing artifacts. Would it be wise to perform some level of initial QC on the VCFs prior to ingesting them into haplocheck ?