Closed Han-Cao closed 1 day ago
Hi @Han-Cao,
In this case, I am wondering which input VCF is better for HiPhase.
The short answer is to pick the one with the best variant calls.
HiPhase will work best on high-quality variant calls, both in terms of base-level sequence accuracy (mostly for indels) and genotype calls. We have not looked at all of the options you mentioned, so I can't comment on each specific combination. For our single-sample pipeline, we usually recommend DeepVariant (SNV/indel), pbsv or sawfish (SVs), and TRGT (STRs). We have done some testing with joint-called VCFs, and those tended to be slightly better simply because the joint-calls were slightly better than individual. Ultimately, it didn't make a huge difference, but I could see someone caring about that difference.
In general, I recommend selecting the option that gives you the best variant call accuracy for an acceptable computational cost. I can't tell you which option that is for your specific use case, but that's the way I would think about this question.
Moreover, for all the VCF, should I first filter the variants (e.g., by GQ, HWE)?
There is some filtering built into the tool (https://github.com/PacificBiosciences/HiPhase/blob/main/docs/user_guide.md#input-filtering), and we have defaults in place that work well for the recommended human workflows.
As with my previous answer, HiPhase works best when you give it the best variants. You would need to evaluate this yourself for your application to determine if the extra filters are beneficial to you or not. I will note that over-filtering can also have a negative impact on phase block length.
Can HiPhase leverage the phased genotypes from the input VCF or just treat all genotypes as unphased?
HiPhase will ignore (treat as unphased) and strip out any existing phase information as of v1.4.5.
Matt
Hi Matt,
Thank you so much for the suggestions. I will try different VCFs to evaluate. But if the existing phase information will be ignored, I think I will not include SNP/INDELs in the SV VCF.
Best, Han
Just to clarify, you will need to include SNPs/indels in at least one of your input VCFs to get good phase blocks.
Matt
Hi,
I would like to phase a few samples with both PacBio HiFi (30X) and Illumina NGS data (30X). I used the HiFi data to generate partially phased assemblies and call variants, resulting in a partially phased VCF of SNP, INDEL, SV. To polish this assembly-based VCF, I want to use HiPhase to further phase it with alignment-based data. In this case, I am wondering which input VCF is better for HiPhase
SNP/INDEL vcf:
SV vcf:
Moreover, for all the VCF, should I first filter the variants (e.g., by GQ, HWE)? Can HiPhase leverage the phased genotypes from the input VCF or just treat all genotypes as unphased?
Thank you very much!