PharmGKB / PharmCAT

The Pharmacogenomic Clinical Annotation Tool
Mozilla Public License 2.0
120 stars 39 forks source link

Recognizing as gVCF a standard VCF file #174

Closed inti4digbi closed 6 months ago

inti4digbi commented 6 months ago

Bug

$ docker run --rm -v /Users/intipedroso/tmp_data/:/pharmcat/data pgkb/pharmcat python3 pharmcat_vcf_preprocessor.py -vcf  data/sample.vcf
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
PharmCAT VCF Preprocessor version: 2.9.0
data/sample.vcf is a gVCF file, which is not currently supported.
The PharmCAT VCF Preprocessor will support block gVCF in the future.

When running the preprocessor or the pipeline I get the warning that input file is a gVCF. The input file is a standard VCF files containing array/chip genotypes + imputed markers.

BinglanLi commented 6 months ago

Thank you for reporting the bug. Could you please share more about your sample file?

A gVCF file is determined if the INFO field has a tag formatted as END=123456, which is a signature of a gVCF file. See examples here.

inti4digbi commented 6 months ago

Hi, yes i can provide the first part of the file but I am not sure it is needed on this case.

I have checked and as you said, the variants with a END=123456 in the INFO field trigger the error. This field as part of the standard vcf format description as far as I know, as per https://samtools.github.io/hts-specs/VCFv4.2.pdf section 1.4.1 number 8.

Is there a preferred practice to deal with this?

BinglanLi commented 6 months ago

VCF v4.3 specs provide a detailed description of the INFO/END field:

END: End reference position (1-based), indicating the variant spans positions POS–END on reference/contig CHROM. Normally... no END INFO field is needed. However when symbolic alleles are used, e.g. in gVCF or structural variants, an explicit END INFO field provides variant span information that is otherwise unknown.

Do you have a non-variant block or a structural variant in your VCF? Could you please share with us the line of the variant with the INFO/END field (without the genotypes)? It helps us understand the issue.

The check on the INFO/END field is a safety check for gVCF. If you are confident that the file is a VCF and has undergone sufficient QC procedures, you can strip the INFO field from your VCF.

inti4digbi commented 6 months ago

Hi, here is an example

1   944010  rs764300897 GGA G   .   .   END=944012

The END field is mostly used to track the end position on the original annotation file. We can ignore it for this analysis.