PharmGKB / PharmCAT

The Pharmacogenomic Clinical Annotation Tool
Mozilla Public License 2.0
120 stars 39 forks source link

Different genotypes for multiple genes #162

Closed nbiesot closed 10 months ago

nbiesot commented 10 months ago

Hi,

After I run the PharmCAT VCF Preprocessor and the Pharmcat pipeline in the Docker container, I get a lot of different genotypes for multiple genes in the Pharmcat Report/JSON. image

The VCF was made, using Dragen v4.2.4 and the fastq files were from the ENA Browser run accession ERR1955327.

What could be a cause for this output?

Thanks for the help!

whaleyr commented 10 months ago

Hi,

In general, there's a few ways of getting more information about the genotype data that should help clarify what's going on.

In the screenshot you posted, in the second column, is a link to the CYP2C19 section. If you click that link it will take you to the section in the report with more information about CYP2C19 in this sample. That section has a table with every position used to make matching diplotype calls and what haplotypes are associated with each position. There's a column for "Call in VCF" that will tell you what PharmCAT saw in your sample at that position. I'm guessing you're going to see a lot of heterozygotic calls at each position. You can confirm the calls by looking in your VCF file and seeing those positions for yourself.

Another thing you can do is output the matcher HTML report. This is a file we typically don't output but you can force PharmCAT to write it out using the --matcher-save-html option (as documented in the "Advanced Usage" section). This report is very information-dense and not the most user-friendly but it shows all the data used by the NamedAlleleMatcher to make diplotype calls. That will have a section for CYP2C19.

Finally, you could just look at the VCF file that you fed into PharmCAT. Take a look at the region for CYP2C19. I'm guessing you're going to see either an abnormal amount of het calls or some other VCF misconfiguration that's resulting in inaccurate reading by PharmCAT.

nbiesot commented 10 months ago

Thank you for the quick response @whaleyr. After clicking on the link in the report for CYP2C19, I observed that the variant rs3758581 (94842866A>G) was found. This variant is present in many alleles (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 22, 23, 24, 25, 26, 28, 29, 31, 32, 33, 35, 39), except for 1 it occurs in combination with other variants (For example, 4 is 94762706A>G + 94842866A>G, according to Pharmvar). Following that, there is a list of variants that were not found, leading to the exclusion of related alleles (for example, chr10:94762706, rs28399504, Missing, A, 4 ). This repeats for all alleles known for CYP2C19, where a variant is not found, resulting in its exclusion. image

Despite the exclusion of the variants, the report still provides a list of genotypes for all combinations with 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 22, 23, 24, 25, 26, 28, 29, 31, 32, 33, 35, 39.

whaleyr commented 10 months ago

Right, that makes sense. Since this one position is homozygous alt and the rest of the positions are missing there's no way for the matcher to know which of these diplotypes match. Missing information means we can't rule out any of those possibilities. CYP2C19 is an atypical gene since the "reference" allele, *1, includes one non-reference variant (most genes have all ref variant alleles for their reference haplotype).

Looks like you've figured it out so I'm going to close this but feel free to comment or reopen if you have more questions.