PacificBiosciences / HiPhase

Small variant, structural variant, and short tandem repeat phasing tool for PacBio HiFi reads
Other
70 stars 4 forks source link

Inconsistent Haplotype #42

Closed hangsuUNC closed 3 months ago

hangsuUNC commented 4 months ago

Hi Matt,

I have a question about the phasing of multiallelic sites. Hiphase is capable to phase multiallelic sites. Does this mean it address the consistency of the final haplotypes, say using bcftools consensus there will be no overlapping variants in a single haplotype?

In the past a few months, we are trying to jointly phase small variants and SVs in a cohort. We are using integrated SV calls and a DeepVariant callset as inputs. We found that Hiphase generate amount of inconsistent haplotypes, i.e. overlapping variants in a single haplotype based on phasing. For example, before hiphase:

`chr1 100326507 chr1-100326508-INS-72 T TTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCTTTCTTTCCTTCTTTCTTTCCTTCC 1 PASS AD_ALL=4;AD_NON_ALT=6;CALIBRATION_SENSITIVITY=0.3791;CollapseId=6470.5;GQ=48;HOM_REF=0,39;HOM_TIG=0,47;ID=chr1-100326508-INS-72;NumCollapsed=8;NumConsolidated=13;QUERY_STRAND=-;SCORE=0.8911;SQ=89;SUPP_PAV=1;SUPP_PBSV=1;SUPP_SNIFFLES=1;SVLEN=72;SVTYPE=INS;TIG_REGION=h1tg011111l:37346-37417;calibration;extracted;training;DP=98;AC=1;AN=2 GT:SQ:GQ:PG:DP:AD:ZS:SS:SCORE:CALIBRATION_SENSITIVITY:SUPP_PBSV:SUPP_SNIFFLES:SUPP_PAV 1|0:58:58:1001:10:7,3:100,100:98,95:0.8225:0.6342:.:.:1

chr1 100326507 chr1-100326508-INS-76 T TTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCTTTCTTTCCTTCTTTCTTTCCTTCCTTCC 1 PASS AD_ALL=2;AD_NON_ALT=0;CALIBRATION_SENSITIVITY=0.3087;CollapseId=6470.3;GQ=5;HOM_REF=0,39;HOM_TIG=0,47;ID=chr1-100326508-INS-76;NumCollapsed=6;NumConsolidated=10;QUERY_STRAND=+;SCORE=0.9063;SQ=61;SUPP_PAV=1;SUPP_PBSV=1;SUPP_SNIFFLES=1;SVLEN=76;SVTYPE=INS;TIG_REGION=h1tg023887l:3682-3757;DP=128;AC=1;AN=2 GT:SQ:GQ:PG:DP:AD:ZS:SS:SCORE:CALIBRATION_SENSITIVITY:SUPP_PBSV:SUPP_SNIFFLES:SUPP_PAV 0|1:100:12:1001:10:3,7:100,100:98,95:0.8626:0.4657:1:1:1`

After Hiphase:

chr1 100326507 chr1-100326508-INS-72 T TTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCTTTCTTTCCTTCTTTCTTTCCTTCC 1 PASS AD_ALL=4;AD_NON_ALT=6;CALIBRATION_SENSITIVITY=0.3791;CollapseId=6470.5;GQ=48;HOM_REF=0,39;HOM_TIG=0,47;ID=chr1-100326508-INS-72;NumCollapsed=8;NumConsolidated=13;QUERY_STRAND=-;SCORE=0.8911;SQ=89;SUPP_PAV=1;SUPP_PBSV=1;SUPP_SNIFFLES=1;SVLEN=72;SVTYPE=INS;TIG_REGION=h1tg011111l:37346-37417;calibration;extracted;training;DP=105252;AN=2;AC=1 GT:SQ:GQ:PG:DP:AD:ZS:SS:SCORE:CALIBRATION_SENSITIVITY:SUPP_PBSV:SUPP_SNIFFLES:SUPP_PAV:PS 1|0:58:58:1001:10:7,3:100,100:98,95:0.8225:0.6342:.:.:1:100000723

chr1 100326507 chr1-100326508-INS-76 T TTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCCTTCTTTCTTTCCTTCTTTCTTTCCTTCCTTCC 1 PASS AD_ALL=2;AD_NON_ALT=0;CALIBRATION_SENSITIVITY=0.3087;CollapseId=6470.3;GQ=5;HOM_REF=0,39;HOM_TIG=0,47;ID=chr1-100326508-INS-76;NumCollapsed=6;NumConsolidated=10;QUERY_STRAND=+;SCORE=0.9063;SQ=61;SUPP_PAV=1;SUPP_PBSV=1;SUPP_SNIFFLES=1;SVLEN=76;SVTYPE=INS;TIG_REGION=h1tg023887l:3682-3757;DP=137472;AN=2;AC=1 GT:SQ:GQ:PG:DP:AD:ZS:SS:SCORE:CALIBRATION_SENSITIVITY:SUPP_PBSV:SUPP_SNIFFLES:SUPP_PAV:PS 1|0:100:12:1001:10:3,7:100,100:98,95:0.8626:0.4657:1:1:1:100000723

These two alternative alleles are phased into the same haplotype. Is this a behavior that expected by Hiphase? if not, can you provide some suggestions how to deal with these inconsistent phasing results?

Thanks,

Hang Su

holtjma commented 4 months ago

Hi Hang Su,

There may be a couple issues here, so I'd like to clarify some things:

  1. It seems like you are showing an input VCF with phasing information already provided. I strongly recommend removing that from the VCF prior to running HiPhase. HiPhase does not strip / correct that information automatically, so you can create VCF files with nonsensical information if you do not clean the input beforehand.
  2. To get the core of the issue: I think you are asking if HiPhase will force overlapping alleles onto different haplotypes, and the short answer there is "no". I should note that we do not want it to force them onto different haplotypes due to the exact situation you're showing above. If you look at the variants themselves, these are two overlapping and highly similar variant calls that presumably came from different callers. We see this all the time when we combine DeepVariant with pbsv calls. They are looking at the exact same event, but the internal mechanics generate slightly different forms for the ALT alleles (in this case, there is an extra copy of the TTCC in the second form). I would recommend looking at the reads in the BAM file after phasing, and I think you will find that one allele (haplotype1 from the above output) has inserted sequence at this location with some variability in the length, and the other allele (haplotype2) does not have either of these insertion events.

Let me know if you have any follow up questions, Matt

rlorigro commented 4 months ago

Hi,

For part 1: Do you recommend simply switching all input GTs from phased | to unphased /? Will any of the variants be unused by Hiphase? If so, can we assume that any remaining / GTs in the output VCF are not useful?

For part 2: Are all of the incompatible variants considered equally well supported by the reads? How would you recommend chaining together a single haplotype for each phase in the output of HiPhase?

holtjma commented 4 months ago
  1. Essentially yes, but you’ll want to make sure things like 1|0 get converted to 0/1. I’m unsure if bcftools has automatic functions for this.
  2. That’s something that you would need to investigate with some other tooling. HiPhase will implicitly trust the calls that are provided. If a variant is unphased after running HiPhase, it means that the best solution left that variant in a homozygous state (could be REF or ALT), and so we left it as unphased.

Matt

holtjma commented 3 months ago

Going to close this for now, feel free to re-open if there are additional related questions / clarifications!