biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 11 forks source link

not edge-merging, the drug-response kp api case #407

Closed colleenXu closed 2 years ago

colleenXu commented 2 years ago

note: BTE doesn't ingest this api directly, but it can be queried using http://localhost:3000/v1/smartapi/adf20dd6ff23dfe18e8e012bde686e31/query

EDIT: updated link. There are 3 records here, that differ in their edge-attributes (particularly the "biolink:has_disease_context" one). We would want them in the TRAPI response as 3 separate KG edges. However, when running a query thru BTE, only 1 record (the last one in the biothings output) is there as an edge.

TRAPI query from gene -> chem ``` { "message": { "query_graph": { "nodes": { "n0": { "ids":["ENSEMBL:ENSG00000181991"], "categories":["biolink:Gene"] }, "n1": { "ids":["PUBCHEM.COMPOUND:71271629"], "categories":["biolink:SmallMolecule"] } }, "edges": { "e1": { "subject": "n0", "object": "n1", "predicates": ["biolink:associated_with_sensitivity_to"] } } } } } ```
the only KG edge found, matches the last entry in the biothings query ``` "95d03331aa6e6a6cf6a28bedf137b113": { "predicate": "biolink:associated_with_sensitivity_to", "subject": "NCBIGene:64963", "object": "PUBCHEM.COMPOUND:71271629", "attributes": [ { "attribute_type_id": "biolink:aggregator_knowledge_source", "value": [ "infores:biothings-explorer" ], "value_type_id": "biolink:InformationResource" }, { "attribute_source": "infores:biothings-multiomics-biggim-drugresponse", "attribute_type_id": "biolink:GeneToDrugAssociation", "description": "Sensitivity to the drug is associated with expression of the gene", "value": "biolink:GeneHasExpressionThatContributesToDrugSensitivityAssociation", "value_type_id": "biolink:id" }, { "attribute_source": "infores:biothings-multiomics-biggim-drugresponse", "attribute_type_id": "BAO:0002162", "description": "Method used to quantify the strength of the association is AUC", "value": "BAO:0002120", "value_type_id": "biolink:id" }, { "attribute_source": "infores:biothings-multiomics-biggim-drugresponse", "attribute_type_id": "EDAM:data_0951", "attributes": { "attribute_source": "infores:biothings-multiomics-biggim-drugresponse", "attribute_type_id": "NCIT:C53236", "description": "Spearman Correlation Test was used to compute the p-value for the association", "value": "NCIT:C53249", "value_type_id": "biolink:id" }, "description": "Confidence metric for the association", "value": 0.007547007781067878, "value_type_id": "EDAM:data_1669" }, { "attribute_source": "infores:biothings-multiomics-biggim-drugresponse", "attribute_type_id": "GECKO:0000106", "description": "Sample size used to compute the correlation", "value": 10 }, { "attribute_source": "infores:biothings-multiomics-biggim-drugresponse", "attribute_type_id": "biolink:has_disease_context", "description": "Disease context for the gene-drug sensitivity association", "value": "MONDO:0020311", "value_type_id": "biolink:id" }, { "attribute_source": "infores:biothings-multiomics-biggim-drugresponse", "attribute_type_id": "biolink:Dataset", "attributes": { "attribute_source": "infores:biothings-multiomics-biggim-drugresponse", "attribute_type_id": "biolink:Publication", "description": "Publication describing the dataset used to compute the association", "value": "PMID:27397505", "value_type_id": "biolink:id" }, "description": "Dataset used to compute the association", "value": "GDSC", "value_type_id": null } ] } ```

note: there is a related issue regarding "missing records". However, it doesn't apply for the example above (<1000 records returned in the biothings response).

Sometimes the records will be "missing" because of the 1000 record limit of what is returned from the biothings api queries

colleenXu commented 2 years ago

In other words, the logic of having unique edges by subject-predicate-object-api-source needs to be expanded.

perhaps we can use specific edge-attributes: when they exist and differ between records, turn these records into separate edges....I suggest the biolink:has_disease_context one above for this particular API...

tokebe commented 2 years ago

Just to solidify my understanding, are you saying that we simply want to add specific edge attributes to the generation of unique edge IDs/comparison of edges, similarly to this issue?

If this is the case, we could do something along the lines of one of the following:

Does one of these cover your expected behavior?

andrewsu commented 2 years ago

I agree that there is an issue here -- only having one of the three records represented is not quite right. But I think the desired behavior is not clear enough to say it's ready for implementation. I'm not convinced that there three records should say as three separate edges. I think we need to raise this with the Architecture call to see if there is a best practice for how to handle this...

colleenXu commented 2 years ago

It sounds like this issue may need further discussion with Multiomics team (Guangrong) and maybe the rest of Translator, especially after I reviewed my understanding of what this API's data is...

At the moment, I think "edge merging and overwriting the edge-attributes" is happening for...

[EDIT] Adding that for each analysis (regardless of it being the same disease context or not), there's a different t-test value and effect-size value...and how to preserve / structure those values is something to consider as well...


My thoughts are that records for point 1 should be completely two separate edges, and records for point 2 should be just 1 edge, but maybe with multiple values to explain the "replicates" thing....

(and a technical note: I think the parser for this API or BTE's coding can handle whatever choices we make, and the data doesn't necessarily need to change)

colleenXu commented 2 years ago

This issue may need to be closed, to open more specific issues around edge-merging:

  1. @andrewsu said we want to collect the list of edge-attribute stuff that are related to "context" (what level of naming - the name of the field in x-bte-response-mapping? or whatever its called inside a record after api-response-transform?). These can then be used when they are available for hashing.

    1. Context is a "statement qualifier" described by Translator's Data Modeling team. However, it's not clear to me what a statement qualifier is. I often think of it as info that makes the association/statement specific: specific to a disease, cell-line, animal species / human specific, etc... Screen Shot 2022-03-15 at 10 14 25 PM
  2. "biological replicates" may need to be handled outside of BTE / the API parsers, by the data providers themselves. It's not clear how BTE would know how to merge the "edge-attributes" when everything is the same except some specific edge-attributes (t-test and pvalue, in this dataset's case)

colleenXu commented 2 years ago

Note: I'd have to review the raw data for drug response kp api, to see if the "biological replicates" discussion is still valid...I'm not seeing effect-size edge-attribute or the "duplicates" that I previously discussed in the recently updated biothings api...

colleenXu commented 2 years ago

Decisions from lab meeting:

Example of record from multiomics drug response kp api (biothings) api with the disease-context: Screen Shot 2022-03-16 at 9 55 45 AM

tokebe commented 2 years ago

Note for possible implementation:

colleenXu commented 2 years ago

Update after talking with Guangrong:

We should therefore address this like we have for disease context: by making sure the edges are kept unique!!!

colleenXu commented 2 years ago

Also after talking to Guangrong: they want to see their KP being called. Making a PR to add this API to BTE...