not edge-merging, the drug-response kp api case

colleenXu commented 2 years ago

note: BTE doesn't ingest this api directly, but it can be queried using http://localhost:3000/v1/smartapi/adf20dd6ff23dfe18e8e012bde686e31/query

EDIT: updated link. There are 3 records here, that differ in their edge-attributes (particularly the "biolink:has_disease_context" one). We would want them in the TRAPI response as 3 separate KG edges. However, when running a query thru BTE, only 1 record (the last one in the biothings output) is there as an edge.

TRAPI query from gene -> chem

``` { "message": { "query_graph": { "nodes": { "n0": { "ids":["ENSEMBL:ENSG00000181991"], "categories":["biolink:Gene"] }, "n1": { "ids":["PUBCHEM.COMPOUND:71271629"], "categories":["biolink:SmallMolecule"] } }, "edges": { "e1": { "subject": "n0", "object": "n1", "predicates": ["biolink:associated_with_sensitivity_to"] } } } } } ```

the only KG edge found, matches the last entry in the biothings query

``` "95d03331aa6e6a6cf6a28bedf137b113": { "predicate": "biolink:associated_with_sensitivity_to", "subject": "NCBIGene:64963", "object": "PUBCHEM.COMPOUND:71271629", "attributes": [ { "attribute_type_id": "biolink:aggregator_knowledge_source", "value": [ "infores:biothings-explorer" ], "value_type_id": "biolink:InformationResource" }, { "attribute_source": "infores:biothings-multiomics-biggim-drugresponse", "attribute_type_id": "biolink:GeneToDrugAssociation", "description": "Sensitivity to the drug is associated with expression of the gene", "value": "biolink:GeneHasExpressionThatContributesToDrugSensitivityAssociation", "value_type_id": "biolink:id" }, { "attribute_source": "infores:biothings-multiomics-biggim-drugresponse", "attribute_type_id": "BAO:0002162", "description": "Method used to quantify the strength of the association is AUC", "value": "BAO:0002120", "value_type_id": "biolink:id" }, { "attribute_source": "infores:biothings-multiomics-biggim-drugresponse", "attribute_type_id": "EDAM:data_0951", "attributes": { "attribute_source": "infores:biothings-multiomics-biggim-drugresponse", "attribute_type_id": "NCIT:C53236", "description": "Spearman Correlation Test was used to compute the p-value for the association", "value": "NCIT:C53249", "value_type_id": "biolink:id" }, "description": "Confidence metric for the association", "value": 0.007547007781067878, "value_type_id": "EDAM:data_1669" }, { "attribute_source": "infores:biothings-multiomics-biggim-drugresponse", "attribute_type_id": "GECKO:0000106", "description": "Sample size used to compute the correlation", "value": 10 }, { "attribute_source": "infores:biothings-multiomics-biggim-drugresponse", "attribute_type_id": "biolink:has_disease_context", "description": "Disease context for the gene-drug sensitivity association", "value": "MONDO:0020311", "value_type_id": "biolink:id" }, { "attribute_source": "infores:biothings-multiomics-biggim-drugresponse", "attribute_type_id": "biolink:Dataset", "attributes": { "attribute_source": "infores:biothings-multiomics-biggim-drugresponse", "attribute_type_id": "biolink:Publication", "description": "Publication describing the dataset used to compute the association", "value": "PMID:27397505", "value_type_id": "biolink:id" }, "description": "Dataset used to compute the association", "value": "GDSC", "value_type_id": null } ] } ```

note: there is a related issue regarding "missing records". However, it doesn't apply for the example above (<1000 records returned in the biothings response).

Sometimes the records will be "missing" because of the 1000 record limit of what is returned from the biothings api queries

colleenXu commented 2 years ago

In other words, the logic of having unique edges by subject-predicate-object-api-source needs to be expanded.

perhaps we can use specific edge-attributes: when they exist and differ between records, turn these records into separate edges....I suggest the biolink:has_disease_context one above for this particular API...

tokebe commented 2 years ago

Just to solidify my understanding, are you saying that we simply want to add specific edge attributes to the generation of unique edge IDs/comparison of edges, similarly to this issue?

If this is the case, we could do something along the lines of one of the following:

Use a specific set of edge-attributes, if present, as part of the hashing input
Use all edge-attributes as part of the hashing input
Outside of the hashing function, compare two edges on the basis of both their hash, and any present edge-attributes (thus avoiding differing hashes due to missing attributes of otherwise equivalent edges, that should still be considered equivalent for whatever reason)

Does one of these cover your expected behavior?

andrewsu commented 2 years ago

I agree that there is an issue here -- only having one of the three records represented is not quite right. But I think the desired behavior is not clear enough to say it's ready for implementation. I'm not convinced that there three records should say as three separate edges. I think we need to raise this with the Architecture call to see if there is a best practice for how to handle this...

colleenXu commented 2 years ago

It sounds like this issue may need further discussion with Multiomics team (Guangrong) and maybe the rest of Translator, especially after I reviewed my understanding of what this API's data is...

At the moment, I think "edge merging and overwriting the edge-attributes" is happening for...

records like my opening post's example, where there were 3 separate diseases contexts
records that look identical except for a different t-test/effect-size value (see point 2 here). According to the explanation here, these are "biological replicates": when the same drug-gene-disease relationship is found after analyzing different datasets.

[EDIT] Adding that for each analysis (regardless of it being the same disease context or not), there's a different t-test value and effect-size value...and how to preserve / structure those values is something to consider as well...

My thoughts are that records for point 1 should be completely two separate edges, and records for point 2 should be just 1 edge, but maybe with multiple values to explain the "replicates" thing....

(and a technical note: I think the parser for this API or BTE's coding can handle whatever choices we make, and the data doesn't necessarily need to change)

colleenXu commented 2 years ago

This issue may need to be closed, to open more specific issues around edge-merging:

@andrewsu said we want to collect the list of edge-attribute stuff that are related to "context" (what level of naming - the name of the field in x-bte-response-mapping? or whatever its called inside a record after api-response-transform?). These can then be used when they are available for hashing.
1. Context is a "statement qualifier" described by Translator's Data Modeling team. However, it's not clear to me what a statement qualifier is. I often think of it as info that makes the association/statement specific: specific to a disease, cell-line, animal species / human specific, etc...
"biological replicates" may need to be handled outside of BTE / the API parsers, by the data providers themselves. It's not clear how BTE would know how to merge the "edge-attributes" when everything is the same except some specific edge-attributes (t-test and pvalue, in this dataset's case)

colleenXu commented 2 years ago

Note: I'd have to review the raw data for drug response kp api, to see if the "biological replicates" discussion is still valid...I'm not seeing effect-size edge-attribute or the "duplicates" that I previously discussed in the recently updated biothings api...

colleenXu commented 2 years ago

Decisions from lab meeting:

we want to add the edge-attribute "biolink:has_disease_context" to the hash, when it is available. This should allow BTE to give diff edges for multiomics drug response kp api (see the opening post) when its values are different.
we want it to be easy to add edge-attributes when situations like this come up.

Example of record from multiomics drug response kp api (biothings) api with the disease-context: Screen Shot 2022-03-16 at 9 55 45 AM

tokebe commented 2 years ago

Note for possible implementation:

function to read config file, with config file defining attributes to check for and add to hash
- function then checks if given attributes exist on record, and gets them if so, combining into a string to add to the hash input.
we may need a new helper function to get edge-attributes

colleenXu commented 2 years ago

Update after talking with Guangrong:

the "biological replicates" are NOT actually what we were thinking when we called them that. They are actually from separate measurements (looking at variants vs expression). They should therefore be represented as separate edges in the Translator space, not merged into 1 edge.

We should therefore address this like we have for disease context: by making sure the edges are kept unique!!!

colleenXu commented 2 years ago

Also after talking to Guangrong: they want to see their KP being called. Making a PR to add this API to BTE...

biothings / biothings_explorer

not edge-merging, the drug-response kp api case #407