EBIvariation / opentargets-pharmgkb

Pipeline to provide evidence strings for Open Targets from PharmGKB
Apache License 2.0
1 stars 1 forks source link

Investigate clinical annotations without RS IDs #28

Open apriltuesday opened 9 months ago

apriltuesday commented 9 months ago

For example CYP2C9*1. Look into the data and discuss how we should represent these.

apriltuesday commented 7 months ago

Some earlier investigation here particularly use of PharmVar.

See also representation in PharmGKB itself, including an allele definition spreadsheet we could leverage.

apriltuesday commented 6 months ago

Notebook here, looking at how named/star alleles are annotated and a bit at how we might resolve them to specific variants. Summary and some questions are at the bottom.

@M-casado @tcezard any thoughts? What else should we look into before discussing with OT?

tcezard commented 6 months ago

The Notebook looks great and you clearly highlight the main issues. My main concern is around the different way the Phenotype is associated with:

We could pass that heterogeneity to OT but I think it is worth highlighting

M-casado commented 6 months ago

Like always, very good job in the thorough analysis with the metrics in a notebook, @apriltuesday. I also have to say that I like all the comments in it, it makes it almost like a novel of a coder's mind when exploring data.

is it safe to just comma-split these strings

I'm pretty sure it's not, given that some names seem to have commas between them (e.g. Mediterranean, Dallas, Panama, Sassari, Cagliari, Birmingham). Perhaps I would actually use the "gene name" as the breaking token, since they seem to add it at the beginning of each variant name.

Now that I saw the API (api.pharmgkb.org/v1) from PGKB, I reckon it may be feasible to parse them and extract some metrics from their spreadsheets. An approach for a similar issue we took at the EGA was parse the first column of the spreadsheets, assign row numbers to the rows of interest (e.g. rsID) and interpret them that way. Although there are not so many (for now) genes with allele tables (pharmgkb_genes).

Speaking of, is there a directory with all download/file/attachment/ files? Just in case the allele_definition_url.format(gene=gene) is not always working because of a wrong name (e.g. someone put an underscore in the filename or something) and we are counting fewer genes than we should. Although I assume you did due diligence, since you also mention 90% of the non-rsIDs have the tables, so the gap shouldn't be big if there was one at all. It's probably just my little trust in files without proper naming conventions.

Or is e.g. *1/first row the reference?

I'm positive it's that way, like we discussed. I checked a few of the rsIDs of CYP2D6, and the ref allele at NCBI was the one at *1.

If so what does missing value mean?

Not sure, since I wasn't able to find an example in the spreadsheet that had the rsID and not the reference, so I couldn't compare.

we can rely on the "Gene" column in PGKB data

Similarly I would advise to get as many raw tables from PGKB as we can, rather than parsing the text produced by them. I'm talking especially about the annotation text field. The fewer text that is generated from structured fields that we need to parse, the better.

Note that our PGx schema uses genotype IDs not variant IDs

Sounds wacky, but if we have the variant IDs and we know the reference, could we not craft a similar genotype ID? Except for the black sheep ones with weird naming conventions (?)

Do we want to resolve named alleles to variants, and if so how to convey this information?

Related to my question during today's meeting: we might as well ask them directly or search for why there is this legacy naming convention, when some have proper variant IDs that are not being used. There may be a good reason, or just a "hey, we didn't make the rules", to which we can adapt.

apriltuesday commented 6 months ago

Thanks Marcos, I'm on the same page as you for pretty much everything you mention. A couple specific points:

I also have to say that I like all the comments in it, it makes it almost like a novel of a coder's mind when exploring data.

Haha I'm glad you appreciate it, I usually clean these up a bit before posting them (because you don't really want to look too closely into my mind...), but it can also be nice to keep them as a sort of "real" research notebook, in case one of you picks up on something I didn't see.

Speaking of, is there a directory with all download/file/attachment/ files?

Yes I'm also mistrustful of the "guess the filename" method of fetching these, I didn't find such a central location but I did at least check that I also couldn't find spreadsheets for the ones that the code couldn't find. I think if we wanted to use this in the pipeline we should ask PGKB about a central location or API.

Sounds wacky, but if we have the variant IDs and we know the reference, could we not craft a similar genotype ID?

I think we could craft a genotype ID for these, the problem would be associating them with the correct annotations when PGKB only has the allele annotated vs. the full genotype - like the examples Tim highlighted above. We used the genotype ID for SNPs because that was consistently the level at which they were annotated, but that's not the case here unfortunately.

I'm planning to spend a bit of time today seeing how many of the allele definition tables are informative (i.e. consist of variants rather than just "not callable"), stay tuned...

apriltuesday commented 6 months ago

Updated notebook with informativeness, plus some basic counts on how many alleles and how many variants are contained in the tables - basically about 64% of the tables that we get (corresponding to 64% of the non-rs records) should list actual variants. It looks like all the "not callable" ones are HLA, which I guess is expected?