Enhance Biolink models for Evidence, Provenance and Confidence/Context metadata capture

nlharris commented 4 years ago

From the discussion in #datamodeling:

Eugene: Beyond the obvious PubMed IDs, what kinds of attributes would be considered core standards for provenance? I brought up ECO codes, which a few others agreed with

Michael P: provenance I have found most useful 1) pubmedID 2) assay/lab method used for experiment WITH A QUANTITATIVE value (ligand-binding-assay affinity, enzyme efficiency w/ drug dose etc..) 3) publication date/journal 4) ECO code -- determine if edge claim is "author hypothesis" (not as good) or linked to assay/lab method 2) that is new/reliable (pretty good)

Eugene: I agree that those are all useful information for provenance. Here are a couple of things to consider as well:

Do we expect for provenance to be attributed to both nodes and edges of all types? For instance, do we need to enforce the evidence for protein/gene existence (node provenance)?
If we're going beyond simple PubMed IDs for provenance, we may need to have a set of standard provenance types, because ARAs will need to know how to differentiate between them to rank responses. Some of what Michael mentioned are IDs, some are strings with semantic meaning that may or may not be standardized in an ontology like ECO, some are integers, and some are dates, versions, publication impact factors, etc. We may need to formalize complex provenance types in our data model.
I suggest if we ever want to formalize provenance in this way, we take stock of the kinds of things we expect to require provenance for, since I imagine that some concepts will require vastly different kinds of proof (e.g. drug efficacy EC50 values vs protein-protein interaction evidence), while there will be some that are common across all or most (ECO annotation or PubMed ID).

From what I gathered, the SRI group wants a set of attributes to require as a standard for nodes/edges. Considering provenance/evidence attributes on edges like (gene) --[encodes]-> (protein) will be different than those on edges like (drug)--[inhibits]-> (protein) we can recommend that certain edge types require a certain set of provenance attributes, but I guess we would need to know all of the edge types first. I believe that explicitly defining these provenance types will make it much easier for ARAs to rank responses, but it's a moot point if this task is prohibitively too difficult or time-consuming to accomplish.

cmungall commented 4 years ago

Current list here:

https://biolink.github.io/biolink-model/docs/association_slot

mikebada commented 4 years ago

This is a position statement arguing for the possibility of defining KP-specific confidence/provenance/epistemic parameters/metadata as opposed to only those standardized across Translator and common to multiple KPs. Speaking for the Text Mining Provider KP, we think that at least some of the KPs will likely require such parameters/metadata unique to those KPs. For example, for each assertion we mine from text and output to a knowledge graph, we think it'd be good to assign some kind of confidence score (e.g., from 0 to 1) that estimates how sure we are that the outputted assertion is correct in the sense that it's what's represented in the corresponding text; this seems pretty unique to our KP. Then, it'd also be good to output some sort of confidence score, somewhat more in line with other such scores, characterizing/estimating the truth of the assertion. However, what's usually available for this in the text is a qualitative epistemic lexical cue (which are very common). Converting these to numerical values seems problematic (other than, e.g., specifying that "likely" > 0.5), so it might be better to output one of a fixed set of such qualitative values (e.g., very unlikely, unlikely, possibly, likely, very likely), which may or may not also be used by other KPs. (This is all up for discussion, of course. Also, we’re not ready to output these yet, but we can envision doing so.) Of course, if KP-specific confidences/provenances/epistemics are allowed for, it'd be important to clearly specify them such that users/ARAs/other KPs can decide for themselves which assertions they would want to use based on their parameters/metadata.

micheldumontier commented 4 years ago

I drafted a note to address three forms of provenance: data retrieval, assertional reporting, and data processing.

https://github.com/biolink/biolink-model/issues/357

hsolbrig commented 4 years ago

A goodly amount of work has been done on this in another domain:

https://ddialliance.org/

Would be worth investigating

nlharris commented 4 years ago

Is this done or still in progress?

deepakunni3 commented 4 years ago

In progress as part of the EPC WG.

RichardBruskiewich commented 3 years ago

I am today generalising this issue ticket to cover the general objective of enhancing Biolink Model support for Evidence, Provenance and Confidence/Context, as generally being discussed in the EPC Working Group and the EPC repository fork of the Biolink Model.

sierra-moxon commented 3 years ago

[ ] add assay node (perhaps with measurements)
[x] add publication details (like date, etc)
[x] discuss provenance of nodes vs. edges, done: just edge provenance for now.
[x] add evidence code list to association object (in #61 )
[ ] add data version

vdancik commented 3 years ago

There is a TRAPI issue NCATSTranslator/ReasonerAPI#97 regarding data version that should be addressed via EPC modeling rather then TRAPI. @sierra-moxon, will you add data version to your list?

sierra-moxon commented 1 year ago

in the last year, we've added several KP-specific edge properties for EPC. In the next year, we plan to do add further edge properties to primarily support "Evidence" display in the Translator UI. I am tempted to close this ticket and open smaller tickets with actionable properties to add. Please do comment if you believe there are actionable items in this ticket now, that I can prioritize for addressing.

biolink / biolink-model

Enhance Biolink models for Evidence, Provenance and Confidence/Context metadata capture #355