PharmGKB / PharmCAT

The Pharmacogenomic Clinical Annotation Tool
Mozilla Public License 2.0
120 stars 39 forks source link

Support for block gVCF #79

Open BinglanLi opened 2 years ago

BinglanLi commented 2 years ago

Add support for converting REF blocks to homozygous reference loci for gVCF.

See https://gatk.broadinstitute.org/hc/en-us/articles/360035531812-GVCF-Genomic-Variant-Call-Format for details.

pmbock1 commented 2 years ago

Hey @BinglanLi , how exactly does the preprocessor determine if the VCF file input is a gVCF? I use GATK HaplotypeCaller and GenotypeGVCF to convert a BAM file into a gVCF and finally a VCF. The v1.3 preprocessor refuses to run on my VCF files, outputting the gVCF error. Any help would be appreciated. Thanks

BinglanLi commented 2 years ago

A gVCF is determined if "ALT=" is presented in the headers. You can try update the header lines after converting gVCF to VCF if you're sure NON_REF is not presented in your VCF in the ALT column. We will add support to gVCF soon to handle reference blocks and "NON_REF" ALTs.

krukanna commented 2 years ago

Do you know release date already? I've got version 1.2.1 and it's working fine, but I would like to upgrade to newer version. Or maybe what can I do to avoid that error?

BinglanLi commented 2 years ago

Does the latest version work after you remove the header line of ALT NON-REF for you VCF?

krukanna commented 2 years ago

It does, thanks

YussAb commented 1 year ago

Dear @BinglanLi , it's not completely clear to me what are the main reasons why gvcf are not currently supported? What are the specific caveats in gvcf files? Is it possible to preprocess gvcfs to make them work?

Or the only way at the moment is to launch the variant calling as a part of the pipeline starting from bam with:

gatk --java-options "-Xmx4g" HaplotypeCaller \ -R grc38.reference.fasta -I input.bam -O output.vcf \ -L pharmcat_positions.vcf -ip 20 --output-mode EMIT_ALL_ACTIVE_SITES

Thank you in advance and best regards, Youssef

BinglanLi commented 1 year ago

gVCF is a fundamentally different format. We are planning on adding support for gVCF in the future. It's just not a priority at the moment given available solutions.

Going from gVCF is a non-trivial multi-step process. The naive approach that converts a gVCF to VCF will cause you to lose INDEL information.

Directly calling PGx positions from BAM files by GATK, like what you wrote, is the best/simplest solution.

If you have a VCF from WGS data and are confident that positions not present in your VCF are homozygous references, you can use the option -0 or --missing-to-ref in the PharmCAT VCF Preprocessor to add back those positions as homozygous references in your VCF file.