biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 11 forks source link

convert smartAPI annotations to explicitly use biolink classes #373

Closed andrewsu closed 1 year ago

andrewsu commented 2 years ago

In https://github.com/biothings/BioThings_Explorer_TRAPI/issues/370, we analyzed the list of 46 APIs currently listed in the config.js file. Among those APIs, there are at least 924 operations in which the subject and object classes do not use the biolink: prefix:

$ cat predicates.csv smartapi.csv | sort -u | wc
   4969   19204  665620
$ cat predicates.csv smartapi.csv | sort -u | grep -v 'biolink:' | wc
    924    2778   94300

Those 924 operations come from 24 unique APIs (actually 23 when removing the header line):

$ cat predicates.csv smartapi.csv | sort -u | grep -v 'biolink:' | gawks '{print $2}' | sort | uniq | wc
     24      71     509

... and the ranked list of "offenders" is here:

$ cat predicates.csv smartapi.csv | sort -u | grep -v 'biolink:' | gawks '{print $2}' | sort | uniq -c | sort -k1nr
    765 BioThings SEMMEDDB API
     24 Clinical Risk KP API
     18 Multiomics Wellness KP API
     17 MyDisease.info API
     14 BioLink API
     12 MyChem.info API
     11 MyGene.info API
     10 Gene Ontology Biological Process API
     10 MyVariant.info API
      9 UBERON Ontology API
      7 BioThings iDISK API
      4 Gene Ontology Cellular Component API
      4 Gene Ontology Molecular Activity API
      4 MGIgene2phenotype API
      2 BioThings DGIdb API
      2 DISEASES API
      2 EBI Proteins API
      2 EBIgene2phenotype API
      2 Human Phenotype Ontology API
      1 LINCS Data Portal API
      1 LitVar API
      1 Ontology Lookup Service API
      1 QuickGO API

For example, both the "BioThings SEMMEDDB API" and the "Clinical Risk KP API" seem to include many operations that do not use the biolink predicates (columns in order are: an index, the API name, the subject class, the predicate, the object class, and the server URL):

$ cat predicates.csv smartapi.csv | sort -u | gawks '$2=="BioThings SEMMEDDB API"' | grep -v 'biolink:' | head
0,BioThings SEMMEDDB API,Cell,affected_by,ChemicalEntity,https://biothings.ncats.io/semmeddb
0,BioThings SEMMEDDB API,Cell,affected_by,Gene,https://biothings.ncats.io/semmeddb
0,BioThings SEMMEDDB API,Cell,affected_by,PhysiologicalProcess,https://biothings.ncats.io/semmeddb
0,BioThings SEMMEDDB API,Cell,affected_by,Polypeptide,https://biothings.ncats.io/semmeddb
0,BioThings SEMMEDDB API,Cell,affected_by,SmallMolecule,https://biothings.ncats.io/semmeddb
0,BioThings SEMMEDDB API,Cell,affects,PhysiologicalProcess,https://biothings.ncats.io/semmeddb
0,BioThings SEMMEDDB API,Cell,contains_process,MolecularActivity,https://biothings.ncats.io/semmeddb
0,BioThings SEMMEDDB API,Cell,contains_process,PhysiologicalProcess,https://biothings.ncats.io/semmeddb
0,BioThings SEMMEDDB API,Cell,disrupted_by,ChemicalEntity,https://biothings.ncats.io/semmeddb
0,BioThings SEMMEDDB API,Cell,disrupted_by,Disease,https://biothings.ncats.io/semmeddb
$ cat predicates.csv smartapi.csv | sort -u | gawks '$2=="Clinical Risk KP API"' | grep -v 'biolink:' | head
0,Clinical Risk KP API,Disease,has_real_world_evidence_of_association_with,Disease,https://biothings.ncats.io/clinical_risk_kp
0,Clinical Risk KP API,Disease,has_real_world_evidence_of_association_with,PhenotypicFeature,https://biothings.ncats.io/clinical_risk_kp
0,Clinical Risk KP API,Disease,has_real_world_evidence_of_association_with,Procedure,https://biothings.ncats.io/clinical_risk_kp
0,Clinical Risk KP API,Disease,has_real_world_evidence_of_association_with,SmallMolecule,https://biothings.ncats.io/clinical_risk_kp
0,Clinical Risk KP API,Disease,negatively_correlated_with,Disease,https://biothings.ncats.io/clinical_risk_kp
0,Clinical Risk KP API,Disease,negatively_correlated_with,PhenotypicFeature,https://biothings.ncats.io/clinical_risk_kp
0,Clinical Risk KP API,Disease,negatively_correlated_with,Procedure,https://biothings.ncats.io/clinical_risk_kp
0,Clinical Risk KP API,Disease,negatively_correlated_with,SmallMolecule,https://biothings.ncats.io/clinical_risk_kp
0,Clinical Risk KP API,PhenotypicFeature,has_real_world_evidence_of_association_with,Disease,https://biothings.ncats.io/clinical_risk_kp
0,Clinical Risk KP API,PhenotypicFeature,has_real_world_evidence_of_association_with,PhenotypicFeature,https://biothings.ncats.io/clinical_risk_kp

This issue is to track discussion on the rationale of this setup and, if appropriate, the fixes.

colleenXu commented 2 years ago

note that there currently isn't an error/bug....so this isn't an urgent issue.

We want a dev to track down what is happening with the biolink prefix on node categories/predicates - when it is added and removed - and figure out a more consistent approach. Note that this involves multiple modules of BTE.

What Andrew is highlighting above is the inconsistency in how we are ingesting metaKG/TRAPI-meta-knowledge-graph info. This happens because...

Related code is hard to read and understand, with the prefix being added and removed in different places.


ideas:

andrewsu commented 1 year ago

From @tokebe: It appears that BTE removes the biolink prefix for all input/output types both for TRAPI and non-TRAPI

I still think it would be cleaner to standardize our practices here, but since the solution doesn't appear to have introduced a significant amount of cruft in our code, this issue is low enough priority to close...