Extract direction of effect from variant annotations

apriltuesday commented 1 year ago

PGKB says they now provide the deconstructed sentences for the clinical annotations in their downloads, so it should be possible to pull out directionality related to the genotype-phenotype association.

apriltuesday commented 6 months ago

I have finally located this, it is in variant annotations which looks to be very useful - one example with a few columns excluded:

Variant/Haplotypes	Gene	Drug(s)	Phenotype Category	Sentence	Alleles	isPlural	Is/Is Not associated	Direction of effect	Functional terms	Gene/gene product	When treated with/exposed to/when assayed with	Multiple drugs And/or	Cell type	Comparison Allele(s) or Genotype(s)
rs1349931378	CYP2C19	mephenytoin, omeprazole	Metabolism/PK	Allele A is associated with decreased activity...	A	Is	Associated with	decreased	activity of	CYP2C19	when assayed with	or	in 293FT cells	G

apriltuesday commented 4 months ago

The following is a summary of the notebook here, looking into variant annotation tables generally and direction of effect more specifically. @tcezard @dsuveges @ireneisdoomed @tskir Please have a look if you like and leave your comments/questions, otherwise we'll go through this notebook at a subsequent meeting.

Overview

PharmGKB provides 3 variant annotation tables, described in the readme as follows:

var_pheno_ann.tsv: Contains associations in which the variant affects a phenotype, with or without drug information. [13,517 rows in the 2024-05 data dump]

var_drug_ann.tsv: Contains associations in which the variant affects a drug dose, response, metabolism, etc. [11,901 rows]

var_fa_ann.tsv: Contains in vitro and functional analysis-type associations. [2,009 rows]

These variant annotations (plus drug labels/guidelines, not covered here) provide evidence for the clinical annotations, connected via an evidence ID. The three types of variant annotations have different but overlapping columns; in general, each row describes an assertion made by a publication (PMID) about the effect of one or more allele/genotypes. See examples here and some breakdown of how the information is provided here.

Taken together, the variant annotations provide evidence for all clinical annotations and some kind of direction of effect for nearly all (>96%). If we select either of the larger tables, we get evidence for about half of all clincial annotations.

Direction of effect representation

If we focus on only one of these tables, we need "only" go through an exercise of selecting which columns we want to extract, and which (if any) we want to map. If we want to use all three tables, since the column schema differs among them, we could do the same thing and have lots of optional attributes to cover all three types of annotation, or just use the free-text sentence that is provided for each annotation.

If we want to use all three tables but present them in a unified and structured way, we need to come up with a generic representation of what "direction of effect" means. Based on the sentence breakdown I came up with one suggestion, but this obviously would require much more discussion:

Direction of effect (decreased, increased, or none)
PD/PK term | Functional term | Side effect/efficacy/other (e.g. response to, activity of, severity of)
Drug | Gene/gene product | Phenotype (e.g. ivacaftor, CFTR, side effect: bone density)

This tells us the direction (1) and what the effect is (2&3). Of course we need some other fields to connect things, but this could be the core "direction of effect" concept.

Of the above, (1) is always either "decreased" or "increased", (2) takes values in a relatively small but not fixed vocabulary (could perhaps be mapped to an EFO term), and (3) is open and would probably ideally be mapped. The values that appear can be found here.

Allele / genotype representation

This is really about how to connect the direction of effect to our current evidence strings. Associating variant annotations to clinical annotations via PMID or evidence ID is the most straightfoward method. Logically, however, direction of effect annotations should be allele / genotype specific, so I looked briefly into how these are represented in the new tables - basically, can we get a direction of effect per genotype or haplotype ID.

I think this is mostly doable, as the representation is consistent with what we've seen in the clinical annotations tables. The exception is that sometimes they've provided a metabolyzer type instead of an allele/genotype for comparison in the variant annotation. See here for an example. Honestly I have no idea how we could handle these right now.

Some questions to consider

Which tables do we want to include?
- Basically are we interested in all the types of effects described
Which columns do we want to use? (for v1 at least)
- Direction of effect can be quite a complex concept, what parts do we want to use as-is (strings) vs. map to ids vs. omit entirely?
How do we want to make the association with clinical annotations?
- via allele/genotype, via PMID, something else?
- aggregated (as PMID is currently) or exploded (as allele/genotype)?

apriltuesday commented 3 months ago

As discussed, this spreadsheet contains examples of clinical annotations with their allele/genotype annotations and variant annotations. There are 4 clinical annotations in total, including one with "metabolyzer type" comparisons which we didn't have time to cover in the meeting. The data is split into tabs by variant annotation type and includes all columns, though I've hidden some to make things a bit easier to read.

Let me know any questions or thoughts you have!

apriltuesday commented 2 days ago

@tcezard @DSuveges @ireneisdoomed @tskir - I've been looking into how to associate variant annotations with clinical annotations at the level of genotype/allele. We can go over this together in our next meeting, but meanwhile feel free to leave your thoughts and questions.

I've done a rough proof-of-concept of what the associations might look like. The algorithm is not particularly clever, it basically amounts to decomposing genotype/allele strings until we get to alleles, and doing exact string matching to line things up. It also makes a core assumption, which is that annotations on alleles can be associated with any genotype containing that allele. So if we have variant annotations on alleles C and T, both will be associated with genotype CT. Conversely, if we have a variant annotation on genotype *1/*18, it will be associated with alleles *1 and *18.

You can see the algorithm itself in the notebook here and the results on a handful of examples in the spreadsheet here (with columns removed for readability).

I also ran this on the entire dataset and found it could successfully associate 93.7% of variant annotations, and it found at least one variant annotation for 65.1% of clinical annotation genotypes. There are definitely some tricky cases I know it fails on, and probably some I don't know about! Also, the approximately 2/3 clinical annotation genotype coverage is not unexpected given that we often don't have variant annotations for the ref/ref genotype.

I've got at least a couple questions for us to discuss:

Does the assumption highlighted above (about genotypes/allele associations) seem reasonable?
What we should do about ref/ref genotypes?

tcezard commented 2 hours ago

I think we should start by reiterating what we are aiming to get out of the association between genotype and variant annotation:

A set of decomposed statement for each genotypes that are hopefully concordant and can be summarised programatically (assuming all the statments have the same direction of effects)
A set of statements associated with each genotype that can be fed to chat-gpt to help summarise the direction of effects
A set of publications associated with each genotype to give confidence in each statement as they could provide independent sources.

This being said, the algorithm is good enough for a first pass and without starting to explode the * alleles in their individual rs components we might not get much better.

For the ref/ref genotypes, they should be associated with the evidence that mentions this allele and not the other. OT could on its side highlight this genotype as the ref/ref so not associated with any evidence

I though we said previously that all Clinical annotation had at least one piece of evidence associated from either of the 3 source. Is the 65% due to the fact that we are counting the exploded clinical annotation per genotype id ?

apriltuesday commented 1 hour ago

I though we said previously that all Clinical annotation had at least one piece of evidence associated from either of the 3 source. Is the 65% due to the fact that we are counting the exploded clinical annotation per genotype id ?

Yes that's exactly it. I just checked coverage of clinical annotations not exploded by genotype, and it's 99.3%. (Of course if we just wanted to associate variant annotations with clinical annotations, not exploding by allele/genotype, our coverage would be 100% as we can do this directly by IDs...)

EBIvariation / opentargets-pharmgkb