CTD processing 3: handling output IDs when multiple ID prefixes are possible

colleenXu commented 1 year ago

Intro: see intro section of https://github.com/biothings/BioThings_Explorer_TRAPI/issues/583#issue-1622873383. Originally noted in https://github.com/biothings/BioThings_Explorer_TRAPI/issues/558#issuecomment-1459097534

3. handling output IDs when multiple ID prefixes are possible

Some operations are commented-out because BTE isn't properly handling the output IDs when multiple ID prefixes are possible. This happens when the output is a disease ID (which can be MESH or OMIM) or a pathway ID (which can be REACT or KEGG).

For example, BTE will fail to recognize that the API response returned both MESH and OMIM Disease IDs and will instead assign all the Disease IDs to the one ID-prefix assigned by the operation.

Edit SmartAPI yaml + run BTE locally

In a local copy of the [SmartAPI yaml](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/CTD/smartapi.yaml), uncomment the `chemical2disease_1` and `chemical2disease_2` operations (lines 125, 127, 211-231, 253-273, 548-551, 556-559). Set up a local instance of BTE to override and use your local copy of the CTD yaml. Then POST to that specific api (v1/smartapi/{id}/query endpoint): ``` { "message": { "query_graph": { "edges": { "e01": { "subject": "n0", "object": "n1", "predicates": ["biolink:related_to"] } }, "nodes": { "n0": { "ids": ["MESH:D004317"], "categories": ["biolink:SmallMolecule"] }, "n1": { "categories": ["biolink:Disease"] } } } } } ```

CTD's raw response

During execution, BTE should generate [this query](http://ctdbase.org/tools/batchQuery.go?inputType=chem&inputTerms=D004317&inputTermSearchType=directAssociations&report=diseases_curated&format=json) to CTD. In CTD's raw response, some Disease IDs are MESH like `MESH:D015746`/ Abdominal Pain and others are OMIM like `OMIM:610141` / QT INTERVAL, VARIATION IN. ``` { "CasRN": "23214-92-8", "ChemicalID": "D004317", "ChemicalName": "Doxorubicin", "DirectEvidence": "marker/mechanism", "DiseaseCategories": "Signs and symptoms", "DiseaseID": "MESH:D015746", "DiseaseName": "Abdominal Pain", "Input": "d004317", "PubMedIDs": "3712578|6542584" }, { "CasRN": "23214-92-8", "ChemicalID": "D004317", "ChemicalName": "Doxorubicin", "DirectEvidence": "marker/mechanism", "DiseaseCategories": "Cardiovascular disease|Pathology (process)", "DiseaseID": "OMIM:610141", "DiseaseName": "QT INTERVAL, VARIATION IN", "Input": "d004317", "PubMedIDs": "12597018|7919046" }, ```

BTE's current flawed response

BTE will do the operation with MESH-Disease-outputs and find the OMIM ID in CTD's response. It'll then strip the OMIM ID prefix off, and then assign it as a MESH ID. This will result in a flawed record -> Edge to `MESH:610141` (when the original ID was `OMIM:610141` / QT INTERVAL, VARIATION IN). ``` "034d46e56b095c750619cc51ee2cb1bf": { "predicate": "biolink:related_to", "subject": "PUBCHEM.COMPOUND:31703", "object": "MESH:610141", ``` Then it'll do similar behavior with the operation with OMIM-Disease outputs and the MESH IDs in CTD's response. So there'll be flawed records -> Edges like this to `OMIM:D015746` (when the original ID was `MESH:D015746`/ Abdominal Pain). ``` "0ca3845c254a304918933c581de85ae4": { "predicate": "biolink:related_to", "subject": "PUBCHEM.COMPOUND:31703", "object": "OMIM:D015746", ```

I think Biolink API / Monarch's post-query processing + SmartAPI yaml response-mapping (which is to fields that exist only after the post-processing) is able to handle this situation, so maybe a solution like that will work here. However, it's perhaps not ideal that multiple operations are written + the same query is done repeatedly for different post-query processing.

This problem is related to past discussions on supporting multiple ID prefixes/namespaces as input / output. I'm not sure how much refactoring of code / x-bte annotation would be needed for a general solution...