Open apriltuesday opened 7 months ago
Ran on the same dataset as the 23.12 submission, using the mentioned PR and the other recent changes.
Total clinical annotations: 5073
With RS: 4477 (88.25%)
1. Exploded by allele: 13497 (3.0x)
2. Exploded by PGx category: 13798 (1.0x)
3. Exploded by drug: 19238 (1.4x)
4. Exploded by phenotype: 23576 (1.2x)
Total evidence strings: 25963
With CHEBI: 21668 (83.46%)
With EFO phenotype: 10938 (42.13%)
With functional consequence: 23842 (91.83%)
With VEP gene: 23842 (91.83%)
Gene comparisons per annotation
With PGKB genes: 4220 (83.19%)
With VEP genes: 4097 (80.76%)
PGKB genes != VEP genes: 772 (15.22%)
Total RS: 2794
With parsed alleles: 2771 (99.18%)
With >2 alleles: 31 (1.12%)
EFO coverage is better (33% -> 42%) but still not amazing, though the cystic fibrosis term highlighted in the OT issue is fixed.
I've dumped unmapped phenotype terms in a spreadsheet here. Perhaps we can look at synonyms or terms provided by PGKB but I'm also wondering whether some of these super generic terms are in evidence being filtered out by OT anyway... e.g. "adverse events".
cc @M-casado @tcezard
With the explicit OLS check added, we bump up to 48.57%:
...
Total evidence strings: 25963
With CHEBI: 21668 (83.46%)
With EFO phenotype: 12610 (48.57%)
With functional consequence: 23842 (91.83%)
With VEP gene: 23842 (91.83%)
...
I've updated the list of unmapped terms as well. As Tim pointed out in the meeting, many of the more generic terms seem to occur in combination with other phenotypes in which context they might make more sense - e.g. for adverse events.
Also cc @tskir, in case you are interested in the unmapped terms in particular.
Refer to opentargets/issues#3149 for context. Tasks on our side: