biolink / biolink-model

Schema and generated objects for biolink data model and upper ontology
https://biolink.github.io/biolink-model/
Other
169 stars 71 forks source link

3 forms of provenance for machine-readable statements #357

Open micheldumontier opened 4 years ago

micheldumontier commented 4 years ago

I can see at least 3 different forms of provenance relating to a (machine-readable) statement e.g. one represented using the biolink model.

The first form of provenance relates to the procurement of the data file containing the statement. In this case, there are potentially 4 different attributes of interest: Where the data were obtained from, when the data were obtained, who obtained the data, and how they were obtained.

provenance of (web-based) data retrieval

The second form of provenance relates to source of the statement. We'll assume here that the primary source of interest is in some machine-readable document (e.g. scientific article, pubmed abstract, data file). There are at least 4 attributes of interest here: in what document can the statement be found, at what time, by whom, and what tool was used to record the assertion.

provenance of assertional reporting

The third form of provenance relates to how the machine-readable association was generated from the original document. This really deals with any processing from that original document - for instance, some algorithm to identify protein-protein interactions from mass spectrometry experiments, human-led curation or automated text processing of the conclusion section of a scientific article into a structured, machine readable statement.

provenance related to data processing

ehinderer commented 4 years ago

One question for clarification: identifying the experimental method used in the original source (e.g. "predicted interaction by homology" versus "x-ray crystallographic characterization") would be accomplished by the generated with statement under the third form?

If this is the case, would we expect to use ECO codes here?

micheldumontier commented 4 years ago

Correct. computational processing of data, whether directly produced by a machine or previously created data - in the broadest sense, falls in the third form.

the manner in which we express what processing was done and how to represent it is cetainly the subject of further discussion, particularly because there are many valid views on how complicated it should be. The use of one or more controlled terminology such as ECO could be one approach to indicate elements of the provenance in the generation of the statement. While ECO has been very useful for curating the biological scientific literature, I have specific concerns about the structure and organisation of ECO along with its scope and potential to extend beyond its current boundaries, but that's a different discussion. An alternative to a simple list of pre-defined terms is to use a terminology describe a sequence, tree, or graph of events, along with the participants and their roles. This is the approach of PROV and SEPIO. The advantage here is that content creators have vastly more flexibility in crafting accurate, schema-compliant descriptions, but the disadvantage is that provenance processing tools must be able to process a provenance graph according to those specifications. That said, for the Translator community, it may make sense to adopt the flexible activity-based provenance description approaches that also admit the use of controlled vocabularies, like ECO, to help craft those machine-readable descriptions.

ehinderer commented 4 years ago

I've also read that ECO plans on collaborating with the Confidence Information Ontology for including information about confidence in addition to methodology in its annotations, but that's an ongoing effort. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6323956/