PharmGKB / PharmCAT

The Pharmacogenomic Clinical Annotation Tool
Mozilla Public License 2.0
120 stars 39 forks source link

PharmCAT Bug in parsing multiallelic variants #139

Closed anh151 closed 1 year ago

anh151 commented 1 year ago

Hello again, PharmCAT v2.4.0 Bcftools/bgzip v1.16 JDK v17.0.4.1

Consider a sample which carriers the TPMT*16 allele.

The multisample VCF shows the following:

CHROM | POS | REF | ALT | 123456

-- | -- | -- | -- | -- chr6 | 18138969 | C | G | 0/0:45:81:PASS:36,0 chr6 | 18138969 | C | T | 0/1:45:81:PASS:22,14

The single sample VCF output from the preprocessor shows the following:

CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | 123456

-- | -- | -- | -- | -- | -- | -- | -- | -- | -- chr6 | 18138969 | rs144041067 | C | G,T | -10 | PASS | PX=TPMT | GT:GQ:RGQ:FT:AD | 2/0:45:81:PASS:36,0

However, the output from PharmCAT is *1/*1 with the following warning: chr6:18138969 Discarding genotype at this position because GT field indicates heterozygous (2/0) but AD field indicates homozygous (36,0)

PharmCAT v2.2.1 did not have this issue. I am having a similar issue for the Mexico City and Vanua Lava alleles in G6PD, 27 in CYP2C9, 36 and *37 in UGT1A1 and likely a few others

-Andrew

BinglanLi commented 1 year ago

Hi Andrew, thanks for reaching out. Hope you're doing well. I am investigating the issue right now. For my convenience, could you please share the header lines about the FORMAT fields? I don't think it would change anything, but it's good to be inclusive in a test.

This seems to be an error by bcftools but it's unclear whether it's something that needs to be reported back to bcftools. I haven't been able to replicate the error and am trying a few more things.

BinglanLi commented 1 year ago

Hey Andrew, I tried with an artificial file with the exact content that you shared (including RGQ) to see if there is any error with v2.4.0. I also tried a test file with two samples to see if it's something to do with splitting into single-sample files. I couldn't replicate the error. The AD field both turned out to be in the expected format. The issue may lie with something else. Is it possible for you to check the AD field in the intermediate files?

anh151 commented 1 year ago

Hi Binglan,

Header from the multi-sample VCF (only first 11 lines):

##hailversion=0.2.107-2387bb00ceee
##FORMAT=<ID=GT,Number=1,Type=String,Description="">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="">
##FORMAT=<ID=FT,Number=1,Type=String,Description="">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="">
##INFO=<ID=AC,Number=.,Type=Integer,Description="">
##INFO=<ID=AF,Number=.,Type=Float,Description="">
##INFO=<ID=AN,Number=1,Type=Integer,Description="">
##INFO=<ID=homozygote_count,Number=.,Type=Integer,Description="">

Header from the single-sample VCF:

##FILTER=<ID=PASS,Description="All filters passed">
##hailversion=0.2.107-2387bb00ceee
##FORMAT=<ID=GT,Number=1,Type=String,Description="">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="">
##FORMAT=<ID=FT,Number=1,Type=String,Description="">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="">
##contig=<ID=chr1,length=248956422,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr2,length=242193529,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr3,length=198295559,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr4,length=190214555,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr5,length=181538259,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr6,length=170805979,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr7,length=159345973,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr8,length=145138636,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr9,length=138394717,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr10,length=133797422,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr11,length=135086622,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr12,length=133275309,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr13,length=114364328,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr14,length=107043718,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr15,length=101991189,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr16,length=90338345,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr17,length=83257441,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr18,length=80373285,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr19,length=58617616,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr20,length=64444167,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr21,length=46709983,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chr22,length=50818468,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chrX,length=156040895,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chrY,length=57227415,assembly=GRCh38.p13,species="Homo sapiens">
##contig=<ID=chrM,length=16569,assembly=GRCh38.p13,species="Homo sapiens">
##INFO=<ID=PX,Number=.,Type=String,Description="Gene">
##FILTER=<ID=PCATxREF,Description="Reference allele does not match PharmCAT reference alleles">
##FILTER=<ID=PCATxALT,Description="Alternate alleles do not match PharmCAT alternate alleles">
##FILTER=<ID=PCATxINDEL,Description="Unexpected format for INDELs">

What are the intermediate files? Would dropping the AD field help?

-Andrew

BinglanLi commented 1 year ago

Thank you. Dropping the AD field will help. We can also modify PharmCAT's behavior when it runs into a spurious AD field. I still want to see if I can replicate the error and whether it's something worth reporting to bcftools developers.

As for the intermediate files, I meant those from the VCF Preprocessor. These files are generated when you have the -k flags on for the VCF Preprocessor. They provide information that helps debug. Some of the files, e.g., an intermediate file containing the PGx regions of your choice, can be helpful if you run PharmCAT periodically to update the PGx calls but on the same freeze of genotype data. This way, you don't have to extract the same genotype data file(s) every time running PharmCAT. I am not sure if you've already been using this function or have tried it out before.

BinglanLi commented 1 year ago

Long story short, it's an issue with how the AD field is defined in the VCF file headers when the VCF is first extracted and prepared.

In your file, AD is defined as Number=., which tells bcftools that it's an unknown field. So, instead of processing the field, bcftools simply copies and pastes the field content from the previous file to the next.

The proper field definition for AD should be Number=R based on the VCF file format specs, which tells the bcftools that the field has one value for each possible allele (including the reference). This way, bcftools will update the field based on possible REF and ALT alleles when it splits or combines multiallelic loci.

# The original test input

##FORMAT=<ID=AD,Number=.,Type=Integer,Description="">
chr6    18138969    rs144041067 C   G   .   PASS    .   GT:AD   0/0:36,0
chr6    18138969    rs144041067 C   T   .   PASS    .   GT:AD   0/1:22,14   

-------------------------------

# The inaccurate outcome caused by the `AD` field definition:

##FORMAT=<ID=AD,Number=.,Type=Integer,Description="">
chr6    18138969    rs144041067 C   G,T .   PASS    PX=TPMT GT:AD   2/0:36,0

-------------------------------

# Input whose `AD` is fixed

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="">
chr6    18138969    rs144041067 C   G   .   PASS    .   GT:AD   0/0:36,0
chr6    18138969    rs144041067 C   T   .   PASS    .   GT:AD   0/1:22,14   

-------------------------------

# Output after the `AD` format is fixed

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="">
chr6    18138969    rs144041067 C   G,T .   PASS    PX=TPMT GT:AD   2/0:22,0,14

In the future, PharmCAT will add support for the AD header (PharmCAT already has support for the AD field in the INFO column) in accordance with the VCF file format specifications:

  1. If the AD field is present in the header, PharmCAT will obey the AD comment in the file header. a. If the AD field is defined as Number=R in the header, PharmCAT will interpret the values in the AD field as allele-specific. b. If the AD field is defined as Number=. in the header, PharmCAT will consider the field value(s) as unknown or unbound.
  2. If the AD field is missing in the header, PharmCAT will issue a warning but assume that the VCF file abides by the VCF file format and the AD field should be used as in case 1a. a. If the AD field turns out not matching the number of alleles, PharmCAT will issue a warning of the invalid AD field.