Improve phenotype EFO mappings

apriltuesday commented 7 months ago

Refer to opentargets/issues#3149 for context. Tasks on our side:

Look at OnToma’s recipes for automated mappings - documentation
Try using exact match search using OLS
Possibly set up meeting with SPOT team to discuss how to best use ZOOMA

apriltuesday commented 6 months ago

Note that Zooma's exact ontology matches might not come back as high confidence with our query, modifying this might help us get back more automated mappings for PGKB and perhaps also ClinVar. See Zooma documentation here.

apriltuesday commented 5 months ago

Ran on the same dataset as the 23.12 submission, using the mentioned PR and the other recent changes.

Total clinical annotations: 5073
        With RS: 4477 (88.25%)
                1. Exploded by allele: 13497 (3.0x)
                2. Exploded by PGx category: 13798 (1.0x)
                3. Exploded by drug: 19238 (1.4x)
                4. Exploded by phenotype: 23576 (1.2x)
Total evidence strings: 25963
        With CHEBI: 21668 (83.46%)
        With EFO phenotype: 10938 (42.13%)
        With functional consequence: 23842 (91.83%)
        With VEP gene: 23842 (91.83%)
Gene comparisons per annotation
        With PGKB genes: 4220 (83.19%)
        With VEP genes: 4097 (80.76%)
        PGKB genes != VEP genes: 772 (15.22%)
Total RS: 2794
        With parsed alleles: 2771 (99.18%)
                With >2 alleles: 31 (1.12%)

EFO coverage is better (33% -> 42%) but still not amazing, though the cystic fibrosis term highlighted in the OT issue is fixed.

I've dumped unmapped phenotype terms in a spreadsheet here. Perhaps we can look at synonyms or terms provided by PGKB but I'm also wondering whether some of these super generic terms are in evidence being filtered out by OT anyway... e.g. "adverse events".

apriltuesday commented 5 months ago

cc @M-casado @tcezard

apriltuesday commented 5 months ago

With the explicit OLS check added, we bump up to 48.57%:

...
Total evidence strings: 25963
        With CHEBI: 21668 (83.46%)
        With EFO phenotype: 12610 (48.57%)
        With functional consequence: 23842 (91.83%)
        With VEP gene: 23842 (91.83%)
...

I've updated the list of unmapped terms as well. As Tim pointed out in the meeting, many of the more generic terms seem to occur in combination with other phenotypes in which context they might make more sense - e.g. for adverse events.

apriltuesday commented 5 months ago

Also cc @tskir, in case you are interested in the unmapped terms in particular.

EBIvariation / opentargets-pharmgkb

Improve phenotype EFO mappings #30