PacificBiosciences / HiPhase

Small variant, structural variant, and short tandem repeat phasing tool for PacBio HiFi reads
Other
70 stars 4 forks source link

Questions on input VCFs #53

Closed Han-Cao closed 1 day ago

Han-Cao commented 2 days ago

Hi,

I would like to phase a few samples with both PacBio HiFi (30X) and Illumina NGS data (30X). I used the HiFi data to generate partially phased assemblies and call variants, resulting in a partially phased VCF of SNP, INDEL, SV. To polish this assembly-based VCF, I want to use HiPhase to further phase it with alignment-based data. In this case, I am wondering which input VCF is better for HiPhase

SNP/INDEL vcf:

  1. per-sample VCF called by deepvariant using PacBio HiFi / Illumina data
  2. Population VCF joint-call by GLnexus, then split into per-sample VCF

SV vcf:

  1. per-sample VCF with SNPs, INDELs, SVs called from assembly (SNPs and INDELs can be slightly different from the above VCF)
  2. per-sample VCF with only SVs called from assembly

Moreover, for all the VCF, should I first filter the variants (e.g., by GQ, HWE)? Can HiPhase leverage the phased genotypes from the input VCF or just treat all genotypes as unphased?

Thank you very much!

holtjma commented 2 days ago

Hi @Han-Cao,

In this case, I am wondering which input VCF is better for HiPhase.

The short answer is to pick the one with the best variant calls.

HiPhase will work best on high-quality variant calls, both in terms of base-level sequence accuracy (mostly for indels) and genotype calls. We have not looked at all of the options you mentioned, so I can't comment on each specific combination. For our single-sample pipeline, we usually recommend DeepVariant (SNV/indel), pbsv or sawfish (SVs), and TRGT (STRs). We have done some testing with joint-called VCFs, and those tended to be slightly better simply because the joint-calls were slightly better than individual. Ultimately, it didn't make a huge difference, but I could see someone caring about that difference.

In general, I recommend selecting the option that gives you the best variant call accuracy for an acceptable computational cost. I can't tell you which option that is for your specific use case, but that's the way I would think about this question.

Moreover, for all the VCF, should I first filter the variants (e.g., by GQ, HWE)?

There is some filtering built into the tool (https://github.com/PacificBiosciences/HiPhase/blob/main/docs/user_guide.md#input-filtering), and we have defaults in place that work well for the recommended human workflows.

As with my previous answer, HiPhase works best when you give it the best variants. You would need to evaluate this yourself for your application to determine if the extra filters are beneficial to you or not. I will note that over-filtering can also have a negative impact on phase block length.

Can HiPhase leverage the phased genotypes from the input VCF or just treat all genotypes as unphased?

HiPhase will ignore (treat as unphased) and strip out any existing phase information as of v1.4.5.

Matt

Han-Cao commented 1 day ago

Hi Matt,

Thank you so much for the suggestions. I will try different VCFs to evaluate. But if the existing phase information will be ignored, I think I will not include SNP/INDELs in the SV VCF.

Best, Han

holtjma commented 1 day ago

Just to clarify, you will need to include SNPs/indels in at least one of your input VCFs to get good phase blocks.

Matt