EBIvariation / opentargets-pharmgkb

Pipeline to provide evidence strings for Open Targets from PharmGKB
Apache License 2.0
1 stars 1 forks source link

Issue 18: Implement genotype IDs to support variants with multiple alleles #24

Closed apriltuesday closed 9 months ago

apriltuesday commented 9 months ago

Closes #18 Better expected output diff here

Note that in this implementation, ref/ref genotypes have no consequence or gene annotated; these will be annotated in other genotypes associated with the same variant though. For example:

RSID Genotype ID Gene Consequence Annotation text
rs3766246 21_36070377_G_A,A ENSG00000159228 SO_0001583 "AA genotype has increased risk..."
rs3766246 21_36070377_G_A,A ENSG00000185917 SO_0001627 "AA genotype has increased risk..."
rs3766246 21_36070377_G_A,G ENSG00000159228 SO_0001583 "AG genotype has increased risk..."
rs3766246 21_36070377_G_A,G ENSG00000185917 SO_0001627 "AG genotype has increased risk..."
rs3766246 21_36070377_G_G,G . . "GG genotype has decreased risk..."

We might need a follow-up issue to modify this behaviour.

I've also added counts for multi-allelic variants as requested by OT, will post the numbers once I run the entire dataset but here's what the report looks like for the test set:

Total clinical annotations: 10
    With RS: 9 (90.00%)
        1. Exploded by allele: 30 (3.3x)
        2. Exploded by drug: 66 (2.2x)
        3. Exploded by phenotype: 78 (1.2x)
Total evidence strings: 80
    With CHEBI: 62 (77.50%)
    With EFO phenotype: 30 (37.50%)
    With functional consequence: 39 (48.75%)
    With VEP gene: 39 (48.75%)
Gene comparisons per annotation
    With PGKB genes: 8 (80.00%)
    With VEP genes: 6 (60.00%)
    PGKB genes != VEP genes: 8 (80.00%)
Total RS: 9
    With parsed alleles: 7 (77.78%)
        With >2 alleles: 1 (14.29%)
M-casado commented 9 months ago

Note that in this implementation, ref/ref genotypes have no consequence or gene annotated

Couldn't we use the reference_genome term from SO. I assume getting the context gene would be fairly easy as well.

apriltuesday commented 9 months ago

Couldn't we use the reference_genome term from SO. I assume getting the context gene would be fairly easy as well.

I was going to ask OT what they prefer but yes, we could get the gene & return a SO term (another possibility is no_sequence_alteration)