Compare genes from VEP and genes from PGKB

apriltuesday commented 11 months ago

Currently the pipeline only includes genes from VEP, but outputs both sets of genes where they differ on a single annotation. We should investigate any discrepancies and plan any necessary next steps.

apriltuesday commented 5 months ago

Assuming we don't flag any problematic discrepancies and hence can confirm the PGKB gene has the same semantics as the VEP gene, OT has requested that we continue to report the VEP genes but fall back on the PGKB genes when VEP fails to provide one.

apriltuesday commented 4 months ago

@tcezard Quick counts from the 2023.12 submission - this is a count of annotations, including only records with RS IDs and only where VEP genes and PGKB genes don't match (so the 770 from here):

Counter({'vep_superset': 140,
         'vep_subset': 273,
         'mismatch': 106,
         'vep_missing': 182,
         'pgkb_missing': 61,
         'divergent_match': 8})

These are "CMAT paper"-style categories, so I think the case we're really worried about is mismatch (though maybe not? what do you think?)

Here is a spreadsheet with just the mismatches, in case you can pick out any patterns.

I think it would also be good to get how many records have both missing, as those are cases where adding PGKB genes won't help... I think for that I would have to rerun the pipeline though.

apriltuesday commented 4 months ago

Updated counts, now including exact_match and both_missing - this is also using newer data, though it doesn't make a big difference.

Counter({'exact_match': 3535,
         'vep_superset': 143,
         'vep_subset': 271,
         'both_missing': 196,
         'mismatch': 106,
         'vep_missing': 185,
         'pgkb_missing': 61,
         'divergent_match': 8})

Also updated the spreadsheet with variant locations for the 106 mismatches, and added a tab with the only 12 records where PGKB has a gene but Biomart could not get Ensembl IDs.

Notebook is viewable here if you're interested...

tcezard commented 4 months ago

Thank you for the counts and the spreadsheet.

Assumptions

I'm going to start by assuming that VEP is providing the correct answer to the question we're asking which is "what is the gene impacted by the variant?" The point of this exercise is to confirm that VEP and PGKB are providing an answer to the same question and that PGKB is answer is similarly accurate.

Added Contribution of PharmGKB genes

There are 4505 variants being queried here 4063 of which have gene associated that can be compared between VEP and PGKB. Overall the two methods are giving similar results 3535 have perfect match, (78.4% total, 87.0% testable) and 3957 (87.8% total, 97.4% testable) are in relative agreement.

Assuming we would use VEP primarily and only add PGKB gene when we cannot get results from VEP. VEP provide a gene in 91.3% of the cases and adding PGKB results would add 4.1%

Accuracy of PharmGKB genes

I looked in more detail at some cases where VEP and PGKB disagree in the spreadsheet and it looks like PGKB genes are off by a couple of KBs. In some case they provide a related gene to the one VEP provides and in others they provide a completely different but it is usually within the vicinity (but not overlapping) of the variant.

This makes me think that PGKB might be using an older (or at least different) set of annotation compare to VEP. My concern is that adding the 4.1% of genes found in PGKB and not in VEP is very likely enriched in gene that would be divergent between the two annotation sets.

Conclusion

I'm not convinced that pulling the gene from PGKB and adding it to the same field in the schema is a sensible thing to do. I think we would loose the ability to know the mechanism by which this gene was assign to the variant and therefore not be able to provide provenance. This said OT might disagree.

apriltuesday commented 4 months ago

I've also looked into why the variant specifically highlighted in this OT issue didn't get an annotation from VEP. The reason in this case is because we failed to resolve the reference allele and thus couldn't query VEP with the coordinates. The location provided by PGKB is NC_000001.11:97740411_97740418 (same as the one provided in dbSNP) which corresponds to the sequence GATGAATGA in the reference (with context base added). But the annotated alleles in the clinical annotation are ATGA,DEL. Since we can't match either of these to the reference, we fail to generate the complete coordinates.

So if falling back on PGKB genes isn't a reliable option, as Tim suggests above, two more options would be:

Implement a more clever reference allele determination algorithm. Some of the complications here were discussed back in this issue, and I've added a tab to the spreadsheet with the "no genotype ID" errors I get from the pipeline (repeats are a known issue which we'll get to at some point)
Query VEP with just position information to get just the overlapping gene. This resolves the most pressing issue and I'm guessing it would be accurate since the positions come from dbSNP, but we wouldn't get a functional consequence and wouldn't have a genotype ID.

@DSuveges @ireneisdoomed @tskir, please take a look and let us know your thoughts.

apriltuesday commented 3 months ago

Closing this as the PGKB vs. VEP comparison is complete, but will continue the work to improve gene coverage under the new Github issue.

EBIvariation / opentargets-pharmgkb