PacificBiosciences / HiPhase

Small variant, structural variant, and short tandem repeat phasing tool for PacBio HiFi reads
Other
70 stars 4 forks source link

how to deal with unphased variants in the output #44

Closed WeiCSong closed 3 months ago

WeiCSong commented 3 months ago

Hi, I'm wondering whether we could fill in the missing phasing information in the hiphase output. Now that there aren't any reads supporting the phase of these variants, i think we could only phase them by statistical methods. Currently mainstream phasing tool seems unable to accept partially phased vcf as input, and i would like to learn from your experience on this task. Thank for your help!

holtjma commented 3 months ago

Hi @WeiCSong,

I'm going to try to answer the questions here, I think there are maybe a few mixed together.

I'm wondering whether we could fill in the missing phasing information in the hiphase output.

There are two main reasons that HiPhase will leave a variant unphased: (1) there are no reads spanning the variant or (2) the phasing does not support a heterozygous variant at that location. HiPhase is a read-backed phaser, so (1) almost never happens because variant calling is from the reads. For (2), this usually happens in regions of the genome with poor mapping and/or variant calling, leading to lower quality variants going into HiPhase. When HiPhase identifies a variant that works better as homozygous (could be REF or ALT), it will leave it unphased in the output.

Now that there aren't any reads supporting the phase of these variants, i think we could only phase them by statistical methods. Currently mainstream phasing tool seems unable to accept partially phased vcf as input, and i would like to learn from your experience on this task.

As I mentioned earlier, HiPhase is intended to be read-backed phasing, so we do not have any plans at the moment to extend the tool into statistical phasing. With that said, my understanding is that SHAPEIT4 is capable of accepting a phased VCF file as input. It was made for WhatsHap inputs, not HiPhase, and I don't have any data to provide regarding accuracy of the combination. But, it may be a place you can start if you want to go down the statistical phasing route.

Matt

hangsuUNC commented 3 months ago

yah, we used Hiphase to first physically phase small and structural variants, then merge them and give it to shapeit4. The preliminary data shows that physical phasing does improve the statistical phasing accuracy in terms of switch error rate.

WeiCSong commented 3 months ago

Thanks for the helpful information @holtjma @hangsuUNC ! i'll try the hiphase + shapeit4 pipeline