Open colleenXu opened 1 year ago
We could have a "hasPrefix" option in the yaml under the output. If this is set to true, then it will expect the ID prefix in the output at the beginning of the ID (ie. like it would parse "MESH:0000" to see it is "MESH" id) instead of using the output id type from smartapi yaml. In this case, instead of the output id being labeled by its id type in the response mapping (like "MESH") it could be named something generic (like "OUTPUT"). This should at least fix the output problem. I could work on this feature.
Work started on multiple-prefixes branch of smartapi-kg and api-response-transform
@rjawesome sorry for the late reply. I'm having trouble understanding your proposal...could you provide an example of x-bte annotation edits you're proposing?
And as I rethink this issue, I wonder if some discussion would help:
outputs
section of the operation?My proposal would just be in the yaml outputs section like so (only for outputs not inputs)
outputs:
- semantic: Disease
hasPrefix: true
Then in the response mapping you would put OUTPUT instead of a prefix like (MESH), ie.
chemical2disease_1:
OUTPUT: data.DiseaseID ## HAS prefix, the ID type will be determined by the prefix
ctd_chemical_disease_interaction_types: data.DirectEvidence
pubmed: data.PubMedIDs
Another feature that could be added here since outputs is already an array, if each output ID corresponds to a different field in the json, then you could just add other outputs if different IDs have different Fields in the JSON. For example, the operation outputs could look like
outputs:
- semantic: Disease
id: MESH
- semantic: Disease
id: OMIM
Then the response mapping would look like
chemical2disease_1:
OMIM: data.DiseaseIDomim ## omim disease id is located here in the json from api
MESH: data.DiseaseIDmesh ## mesh disease id is located here in the json from api
ctd_chemical_disease_interaction_types: data.DirectEvidence
pubmed: data.PubMedIDs
In this case, BTE could split this into two operations, or when the response is recieved from the API it could just see which id type exists in the output.
The first feature (hasPrefix) is currently working in multiple-prefixes branch
Another feature that could be added here since outputs is already an array, if each output ID corresponds to a different field in the json, then you could just add other outputs if different IDs have different Fields in the JSON. For example, the operation outputs could look like
outputs: - semantic: Disease id: MESH - semantic: Disease id: OMIM
Then the response mapping would look like
chemical2disease_1: OMIM: data.DiseaseIDomim ## omim disease id is located here in the json from api MESH: data.DiseaseIDmesh ## mesh disease id is located here in the json from api ctd_chemical_disease_interaction_types: data.DirectEvidence pubmed: data.PubMedIDs
In this case, BTE could split this into two operations, or when the response is recieved from the API it could just see which id type exists in the output.
Another way to solve this problem could be using JQ Post processing + the hasPrefix/OUTPUT feature. JQ post processing could move all the ids into the same field in the json before response mapping. So the operation could have
transformers:
wrap_jq: "{data: [.[] | if .DiseaseIDomim then .DiseaseID = "OMIM:" + .DiseaseIDomim else . end | if .DiseaseIDmesh then .DiseaseID = "MESH:" + .DiseaseIDmesh else . end]}"
Then the usage of hasPrefix and OUTPUT in the response mapping would be exactly the same as the first proposal.
@rjawesome could you pause your work on this particular issue? and keep the work specific to this issue on a separate branch from use-jmes-path
(maybe you've already done this with multiple-prefixes
)?
After talking with @tokebe, we agreed that there's some larger-scale issues that still have be worked out, like:
so I plan to write an issue and start discussions on that. I think after those discussions, it'll be clearer what the actual requirements / behavior we want for this issue is...
[EDIT: oh, one thing for sure is that in this use case and similar situations (one field, multiple ID prefixes), processing of the raw API response WILL BE REQUIRED to organize the IDs by namespace]
Intro: see intro section of https://github.com/biothings/BioThings_Explorer_TRAPI/issues/583#issue-1622873383. Originally noted in https://github.com/biothings/BioThings_Explorer_TRAPI/issues/558#issuecomment-1459097534
3. handling output IDs when multiple ID prefixes are possible
Some operations are commented-out because BTE isn't properly handling the output IDs when multiple ID prefixes are possible. This happens when the output is a disease ID (which can be MESH or OMIM) or a pathway ID (which can be REACT or KEGG).
For example, BTE will fail to recognize that the API response returned both MESH and OMIM Disease IDs and will instead assign all the Disease IDs to the one ID-prefix assigned by the operation.
Edit SmartAPI yaml + run BTE locally
In a local copy of the [SmartAPI yaml](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/CTD/smartapi.yaml), uncomment the `chemical2disease_1` and `chemical2disease_2` operations (lines 125, 127, 211-231, 253-273, 548-551, 556-559). Set up a local instance of BTE to override and use your local copy of the CTD yaml. Then POST to that specific api (v1/smartapi/{id}/query endpoint): ``` { "message": { "query_graph": { "edges": { "e01": { "subject": "n0", "object": "n1", "predicates": ["biolink:related_to"] } }, "nodes": { "n0": { "ids": ["MESH:D004317"], "categories": ["biolink:SmallMolecule"] }, "n1": { "categories": ["biolink:Disease"] } } } } } ```CTD's raw response
During execution, BTE should generate [this query](http://ctdbase.org/tools/batchQuery.go?inputType=chem&inputTerms=D004317&inputTermSearchType=directAssociations&report=diseases_curated&format=json) to CTD. In CTD's raw response, some Disease IDs are MESH like `MESH:D015746`/ Abdominal Pain and others are OMIM like `OMIM:610141` / QT INTERVAL, VARIATION IN. ``` { "CasRN": "23214-92-8", "ChemicalID": "D004317", "ChemicalName": "Doxorubicin", "DirectEvidence": "marker/mechanism", "DiseaseCategories": "Signs and symptoms", "DiseaseID": "MESH:D015746", "DiseaseName": "Abdominal Pain", "Input": "d004317", "PubMedIDs": "3712578|6542584" }, { "CasRN": "23214-92-8", "ChemicalID": "D004317", "ChemicalName": "Doxorubicin", "DirectEvidence": "marker/mechanism", "DiseaseCategories": "Cardiovascular disease|Pathology (process)", "DiseaseID": "OMIM:610141", "DiseaseName": "QT INTERVAL, VARIATION IN", "Input": "d004317", "PubMedIDs": "12597018|7919046" }, ```BTE's current flawed response
BTE will do the operation with MESH-Disease-outputs and find the OMIM ID in CTD's response. It'll then strip the OMIM ID prefix off, and then assign it as a MESH ID. This will result in a flawed record -> Edge to `MESH:610141` (when the original ID was `OMIM:610141` / QT INTERVAL, VARIATION IN). ``` "034d46e56b095c750619cc51ee2cb1bf": { "predicate": "biolink:related_to", "subject": "PUBCHEM.COMPOUND:31703", "object": "MESH:610141", ``` Then it'll do similar behavior with the operation with OMIM-Disease outputs and the MESH IDs in CTD's response. So there'll be flawed records -> Edges like this to `OMIM:D015746` (when the original ID was `MESH:D015746`/ Abdominal Pain). ``` "0ca3845c254a304918933c581de85ae4": { "predicate": "biolink:related_to", "subject": "PUBCHEM.COMPOUND:31703", "object": "OMIM:D015746", ```I think Biolink API / Monarch's post-query processing + SmartAPI yaml response-mapping (which is to fields that exist only after the post-processing) is able to handle this situation, so maybe a solution like that will work here. However, it's perhaps not ideal that multiple operations are written + the same query is done repeatedly for different post-query processing.
This problem is related to past discussions on supporting multiple ID prefixes/namespaces as input / output. I'm not sure how much refactoring of code / x-bte annotation would be needed for a general solution...