biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 11 forks source link

CTD processing 3: handling output IDs when multiple ID prefixes are possible #585

Open colleenXu opened 1 year ago

colleenXu commented 1 year ago

Intro: see intro section of https://github.com/biothings/BioThings_Explorer_TRAPI/issues/583#issue-1622873383. Originally noted in https://github.com/biothings/BioThings_Explorer_TRAPI/issues/558#issuecomment-1459097534

3. handling output IDs when multiple ID prefixes are possible

Some operations are commented-out because BTE isn't properly handling the output IDs when multiple ID prefixes are possible. This happens when the output is a disease ID (which can be MESH or OMIM) or a pathway ID (which can be REACT or KEGG).

For example, BTE will fail to recognize that the API response returned both MESH and OMIM Disease IDs and will instead assign all the Disease IDs to the one ID-prefix assigned by the operation.

Edit SmartAPI yaml + run BTE locally In a local copy of the [SmartAPI yaml](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/CTD/smartapi.yaml), uncomment the `chemical2disease_1` and `chemical2disease_2` operations (lines 125, 127, 211-231, 253-273, 548-551, 556-559). Set up a local instance of BTE to override and use your local copy of the CTD yaml. Then POST to that specific api (v1/smartapi/{id}/query endpoint): ``` { "message": { "query_graph": { "edges": { "e01": { "subject": "n0", "object": "n1", "predicates": ["biolink:related_to"] } }, "nodes": { "n0": { "ids": ["MESH:D004317"], "categories": ["biolink:SmallMolecule"] }, "n1": { "categories": ["biolink:Disease"] } } } } } ```
CTD's raw response During execution, BTE should generate [this query](http://ctdbase.org/tools/batchQuery.go?inputType=chem&inputTerms=D004317&inputTermSearchType=directAssociations&report=diseases_curated&format=json) to CTD. In CTD's raw response, some Disease IDs are MESH like `MESH:D015746`/ Abdominal Pain and others are OMIM like `OMIM:610141` / QT INTERVAL, VARIATION IN. ``` { "CasRN": "23214-92-8", "ChemicalID": "D004317", "ChemicalName": "Doxorubicin", "DirectEvidence": "marker/mechanism", "DiseaseCategories": "Signs and symptoms", "DiseaseID": "MESH:D015746", "DiseaseName": "Abdominal Pain", "Input": "d004317", "PubMedIDs": "3712578|6542584" }, { "CasRN": "23214-92-8", "ChemicalID": "D004317", "ChemicalName": "Doxorubicin", "DirectEvidence": "marker/mechanism", "DiseaseCategories": "Cardiovascular disease|Pathology (process)", "DiseaseID": "OMIM:610141", "DiseaseName": "QT INTERVAL, VARIATION IN", "Input": "d004317", "PubMedIDs": "12597018|7919046" }, ```
BTE's current flawed response BTE will do the operation with MESH-Disease-outputs and find the OMIM ID in CTD's response. It'll then strip the OMIM ID prefix off, and then assign it as a MESH ID. This will result in a flawed record -> Edge to `MESH:610141` (when the original ID was `OMIM:610141` / QT INTERVAL, VARIATION IN). ``` "034d46e56b095c750619cc51ee2cb1bf": { "predicate": "biolink:related_to", "subject": "PUBCHEM.COMPOUND:31703", "object": "MESH:610141", ``` Then it'll do similar behavior with the operation with OMIM-Disease outputs and the MESH IDs in CTD's response. So there'll be flawed records -> Edges like this to `OMIM:D015746` (when the original ID was `MESH:D015746`/ Abdominal Pain). ``` "0ca3845c254a304918933c581de85ae4": { "predicate": "biolink:related_to", "subject": "PUBCHEM.COMPOUND:31703", "object": "OMIM:D015746", ```

I think Biolink API / Monarch's post-query processing + SmartAPI yaml response-mapping (which is to fields that exist only after the post-processing) is able to handle this situation, so maybe a solution like that will work here. However, it's perhaps not ideal that multiple operations are written + the same query is done repeatedly for different post-query processing.

This problem is related to past discussions on supporting multiple ID prefixes/namespaces as input / output. I'm not sure how much refactoring of code / x-bte annotation would be needed for a general solution...

rjawesome commented 1 year ago

We could have a "hasPrefix" option in the yaml under the output. If this is set to true, then it will expect the ID prefix in the output at the beginning of the ID (ie. like it would parse "MESH:0000" to see it is "MESH" id) instead of using the output id type from smartapi yaml. In this case, instead of the output id being labeled by its id type in the response mapping (like "MESH") it could be named something generic (like "OUTPUT"). This should at least fix the output problem. I could work on this feature.

rjawesome commented 1 year ago

Work started on multiple-prefixes branch of smartapi-kg and api-response-transform

colleenXu commented 1 year ago

@rjawesome sorry for the late reply. I'm having trouble understanding your proposal...could you provide an example of x-bte annotation edits you're proposing?

And as I rethink this issue, I wonder if some discussion would help:

rjawesome commented 1 year ago

My proposal would just be in the yaml outputs section like so (only for outputs not inputs)

outputs:
- semantic: Disease
  hasPrefix: true

Then in the response mapping you would put OUTPUT instead of a prefix like (MESH), ie.

 chemical2disease_1:
       OUTPUT: data.DiseaseID          ## HAS prefix, the ID type will be determined by the prefix
       ctd_chemical_disease_interaction_types: data.DirectEvidence
       pubmed: data.PubMedIDs
rjawesome commented 1 year ago

Another feature that could be added here since outputs is already an array, if each output ID corresponds to a different field in the json, then you could just add other outputs if different IDs have different Fields in the JSON. For example, the operation outputs could look like

outputs:
- semantic: Disease
  id: MESH
- semantic: Disease
  id: OMIM

Then the response mapping would look like

chemical2disease_1:
       OMIM: data.DiseaseIDomim          ## omim disease id is located here in the json from api
       MESH:  data.DiseaseIDmesh    ## mesh disease id is located here in the json from api
       ctd_chemical_disease_interaction_types: data.DirectEvidence
       pubmed: data.PubMedIDs

In this case, BTE could split this into two operations, or when the response is recieved from the API it could just see which id type exists in the output.

rjawesome commented 1 year ago

The first feature (hasPrefix) is currently working in multiple-prefixes branch

rjawesome commented 1 year ago

Another feature that could be added here since outputs is already an array, if each output ID corresponds to a different field in the json, then you could just add other outputs if different IDs have different Fields in the JSON. For example, the operation outputs could look like

outputs:
- semantic: Disease
  id: MESH
- semantic: Disease
  id: OMIM

Then the response mapping would look like

chemical2disease_1:
       OMIM: data.DiseaseIDomim          ## omim disease id is located here in the json from api
       MESH:  data.DiseaseIDmesh    ## mesh disease id is located here in the json from api
       ctd_chemical_disease_interaction_types: data.DirectEvidence
       pubmed: data.PubMedIDs

In this case, BTE could split this into two operations, or when the response is recieved from the API it could just see which id type exists in the output.

Another way to solve this problem could be using JQ Post processing + the hasPrefix/OUTPUT feature. JQ post processing could move all the ids into the same field in the json before response mapping. So the operation could have

transformers:
  wrap_jq: "{data: [.[] | if .DiseaseIDomim then .DiseaseID = "OMIM:" + .DiseaseIDomim else . end |  if .DiseaseIDmesh then .DiseaseID = "MESH:" + .DiseaseIDmesh else . end]}"

Then the usage of hasPrefix and OUTPUT in the response mapping would be exactly the same as the first proposal.

colleenXu commented 1 year ago

@rjawesome could you pause your work on this particular issue? and keep the work specific to this issue on a separate branch from use-jmes-path (maybe you've already done this with multiple-prefixes)?

After talking with @tokebe, we agreed that there's some larger-scale issues that still have be worked out, like:

so I plan to write an issue and start discussions on that. I think after those discussions, it'll be clearer what the actual requirements / behavior we want for this issue is...

[EDIT: oh, one thing for sure is that in this use case and similar situations (one field, multiple ID prefixes), processing of the raw API response WILL BE REQUIRED to organize the IDs by namespace]