Association slots for 'source provenance' tracking

This ticket summarizes the latest proposals for how to track the upstream "Information Resources" from which an Edge in a KP or TRAPI message was retrieved (directly or indirectly). "Information Resources" here includes Translator KPs, and external community knowledgebases. The type of 'provenance' we are concerned with here is the mechanical retrieval/ETL of data - not how a KP or ARA might generate new knowledge from data, or curate knowledge from a publication or dataset.

Proposal 1: An initial proposal aimed to be explicit about different types of sources that we might want to distinguish using different named edge properties. These included:

knowledge_provider_source - the KP that provided the edge to an ARA
original_source - the most traceable upstream source of the knowledge that a KP can identify
aggregator_source - an intermediate aggregating KB through which the knowledge passed on its way to its representation in a KP's graph.
direct_source - the resource form which the KP most directly retrieved the knowledge expressed in an edge.
supporting_data_source - a resource that provided data on which a KP operated to compute/generate new knowledge.

In evaluating this proposal, several scenarios were identified that would not fit neatly into this representation (e.g. a KP retrieving an edge form another KP), or that could cause confusion when viewed in the context of a KP graph vs a message to / from an ARA (e.g. the 'direct_source of an Edge' would be different once it is passed from a KP to an ARA). And there was a large potential for duplication, and multiple ways KPs might end up representing things.

Proposal 2: An alternative was proposed where there is a single edge property for capturing 1-4 above, that holds an ordered list of information resources. The 'role' of a given information resource is not explicitly named, but can be inferred from its position in this list - i.e. the 'direct source' is first in the list, the most upstream, 'original source' would be last in the list. This list can grow dynamically as the edge is passed between resources/systems - to indicate its growing provenance trail.

knowledge_sources: an ordered list of resources through which the knowledge passed on its way to being expressed in an Edge.
supporting_data_sources: we still want to distinguish the notion of sources that provide data used in inference/computation of new knowledge, from sources that provide the knowledge itself.

In evaluating this approach, concerns here were raised about the ability to accurately interpret such a list and parse out a more explicit representation of the role each source plays - e.g. for end users who may want to more clearly see what the original source was for a given edge. Many felt that the diversity of possible provenance trail scenarios would make consistent and unambiguous translation/interpretation of source roles difficult.

Concerns were also raised about the inability to capture additional retrieval/source metadata about each information resource - as they are all bundled into a single attribute (either using the built in attribute fields, or extensions implemented through nesting of attributes).

Proposal 3a: A hybrid approach was also proposed to explicitly split out the 'original_source' as a named field, as it was felt this was the most important role for end users to discern. All other 'sources' would be captured as an ordered list in a single 'aggregator_sources' field (which may include external aggregators, or internal KPs - but it would not include the original source).

The resulting set of source provenance edge properties would include:

original_source: the most upstream source that can be identified. Can be used only once per edge.
aggregator_source: an ordered list of other resources through which the knowledge passed on its way to being expressed in an Edge (list would not include the 'original' source)
supporting_data_source: same as above.

Evaluation of this proposal raised concerns about being able to definitively say what the 'original sources' was - and that in many cases the value of this property would be the most upstream source a KP is able/willing to discern, but not necessarily the 'original' source. And also the fact that it precludes attaching additional metadata to specific source in the list of 'aggregators'.

Proposal 3b: A slight variation on the hybrid approach above, where each intermediate/aggregator_source is captured in a separate attribute object. This is most similar to Proposal 1 - but uses fewer edge properties to simplify data creation, and reduce the potential for duplication and inconsistent representations across sources. Loss of 'kp_source' is acceptable as it remains clear which resources are translator KPs. Loss of 'direct_source' is acceptable (and can even be approximated through use of a nested attribute to capture a retrieval index number). Overall, this approach strikes a nice balance between expressivity and extensibility of Proposal 1, and the simplicity of Proposals 2 and 3.

original_source: same as above - the most upstream source that can be identified. Can be used only once per edge.
aggregator_source: as above, captures another resource through which the knowledge passed on its way to being expressed in an Edge. The key difference here is that only one resource allowed per attribute here (as opposed to proposal above where this is an ordered list).
supporting_data_source: same as above.

This more extensible approach would allow for additional metadata to be captured about retrieval from each source in the trail, including intermediate aggregators - because each is represented as the sole value of an Attribute object, from which we can nest Attributes holding additional details about them (e.g. a specific API endpoint was used, version info, date of retrieval, etc.).

This approach may be preferred if we think that there will be use cases requiring such additional details about retrieval steps to be captured in the provenance trail. Note that if retrieval order is important, it can still be captured in a nested attribute that holds a retrieval index number (e.g. 1 = direct source, 2, = next upstream source, . . . ).

This ticket is to discuss feedback on and development of these proposals. Note that a gist with data examples demonstrating how each approach represents data for an common scenario can be found here.

The conclusion during the DM meeting on Apr-22 is to go with proposal 3b.

One lingering concern that has been expressed by several folks about 'Proposal 3b' is that the model forces one of two choices for KPs: original vs aggregator source. KPs often cannot tell for sure if an Info Resource they retrieved knowledge from is truly the original source. In practice, KPs might use this 'original_source' property for most upstream source they can identify, because declaring it an ‘aggregator’ seems wrong - but it may not truly be the original source.

A slight change to Proposal 3b above that introduces a more general 'knowledge_source' property, and a new 'primary_knowledge_source' property, could help make things a bit more accommodating and precise.

Proposal 3C: A small hierarchy of properties.

knowledge_source (doesn't commit to being aggregator, primary, or original).
- primary_knowledge_source (used for the furthest upstream source in the chain that the data creator can identify)
  - original_knowledge_source (used when KP is confident that the primary source is the original source)
- aggregator_knowledge_source (used when the KP is confident the source is an aggregator)
supporting_data_source (used for sources of data that a KP computes on to generate new knowledge)

This approach gives KPs options to more clearly and precisely indicate the role of each source in the retrieval of knowledge expressed in an edge. Our recommendation would be for data creators to minimal distinguish 'primary' (most upstream) from 'aggregator' (intermediate) sources - as we feel that this can be determined in nearly all cases. If the data creator is confident that the primary source was the original source, they can use the 'original_knowledge_source' property. In practice, for a linear chain of retrieval, one source should be declared 'primary' or 'original', and the rest 'aggregators'.

Based on this Proposal 3C, our running data example would be represented as below:

# A message sent to Ranking Agent from Molepro, holding a single ChemicalToDisease Edge. The retrieval path is as follows:

# RankingAgent  --retrieved_from-->   MolePro  --retrieved_from-->  ChEMBL  --retrieved_from-->  ClinicalTrials.gov

# Below is the edge as represented by the Ranking Agent ARA
"edges": [
  {
  "id": "Association001",   
  "category": "biolink:ChemicalToGeneAssociation",
  "subject": "chebi:3215",        
  "predicate": "biolink:interacts_with",
  "object": "ncbigene:51176",
  "attributes":  [
    {
      "attribute_type_id": "biolink:original_knowledge_source",  # Assumes data creator knows this is the original source. If not, they could use 'primary_knowledge_source'.
      "value":  "infores:clinicaltrials",  
      "value_type_id": "biolink:InformationResource",     
      "value_url":  "https://www.clinicaltrials.gov",
      "description": "ClinicalTrials.gov is...",
      "attribute_source": "infores:chembl"
    },
    {
      "attribute_type_id": "biolink:aggregator_knowledge_source",  # Assumes data creator knows this is an aggregator, and wants to assert this. 
      "value":  "infores:chembl",
      "value_type_id": "biolink:InformationResource",     
      "value_url":  "https://www.ebi.ac.uk/chembl",
      "description": "ChEMBL is a manually curated database of bioactive molecules...",
      "attribute_source": "infores:molepro_kp"
    }, 
    {
      "attribute_type_id": "biolink:aggregator_knowledge_source",    
      "value": "infores:molepro_kp",       
      "value_type_id": "biolink:InformationResource",     
      "value_url":  "https://translator.broadinstitute.org/molepro/trapi/v1.0",
      "description": "The Molecular Data Provider KP from NCATS Translator",
      "attribute_source": "infores:ranking_agent",     # Ranking Agent would add this info when it retrieves the edge
     },

# Additional attributes holding supporting publication and record_url metadata
     {
      "attribute_type_id": "biolink:has_supporting_publication",  
      "value": "pmid:2012345",       
      "value_type_id": "biolink:Publication", 
      "value_url":  "https://pubmed.ncbi.nlm.nih.gov/2012345",
      "description": "Am J Vet Res.,1991 Feb;52(2):328-32",
      "attribute_source": "infores:chembl",
    },
    {
      "attribute_type_id": "biolink:source_record_url",  
      "value": "https://www.ebi.ac.uk/chembl/target_report_card/CHEMBL3217392/",       
      "value_type_id": "biolink:uriorcurie",     
      "description": "The data record reporting the bupivacaine-LEF-1 interaction",
      "attribute_source": "infores:molepro_kp"  
    }
  ]
 }
]

Tagging @vdancik @edeutsch @patrickkwang @cbizon @cmungall @RichardBruskiewich @sierra-moxon @colleenXu - as these folks have provided feedback to date, including some of the concerns that we feel this latest (and hopefully final) proposal addresses.

I am fine with this. Seems sensible. I don't feel strongly.

It is also important to consider a different scenario where the knowledge expressed in an Edge was generated by a KP, based on data they retrieved from some external Information Resource. e.g. an Association generated by the Exposures Provider ICEES tool by computing on EHR and environmental datasets, and passed to the Ranking Agent ARA.

The retrieval path for this example is simple: RankingAgent --retrieved_from--> Exposures Provider

The edge as represented in an ARA message:

"edges": [
  {
  "id": "Association001",   
  "category": "biolink:FeatureVariableAssociation",
  "subject": "PM2.5 Exposure",        
  "predicate": "biolink:correlates_with,
  "object": "ED Visits for Asthma",
  "attributes":  [
    {
      "attribute_type_id": "biolink:original_knowledge_source",  
      "value":  "infores:exposures_provider",  
      "value_type_id": "biolink:InformationResource",     
      "value_url":  "https://icees.renci.org:16340/openapi.json",
      "description": "The Exposures Provider ...",
      "attribute_source": "infores:exposures_provider"  # Convention should be for KPs that generate novel associations to capture themselves as the original_source?
    },
    {
      "attribute_type_id": "biolink:supporting_data_source",   # The information resource where the data used to generate the association was retrieved from
      "value":  ["Carolina Data Warehouse for Health API". "US EPA CMAQ Airborne Exposures API", "US DOT Roadway Exposures API", "US Census Bureau ACS API"] 
      "value_type_id": "biolink:InformationResource",     
      "value_url":  . . . ,
      "description": . . ., 
      "attribute_source": "infores:exposures_provider"  
    },
    {
      "attribute_type_id": "biolink:supporting_dataset",   #  The specific dataset used to generate the association was retrieved from
      "value": ["Carolina Data Warehouse for Health Dataset", "US EPA Community Multiscale Air Quality Modeling Dataset", " US DOT Highway Patrol Monitoring Dataset", "US Census Bureau American Community Survey Dataset"]  
      "value_type_id": "biolink:Dataset",     
      "description": . . .,
      "attribute_source": "infores:exposures_provider"  
    }
   ]
  }
 ]

This example raises several other questions to consider, including:

how to identify/reference datasets
how to reference multiple data sources and supporting datasets (list in a single attribute, or split into separate attributes),
if/how to capture aggregators for supporting data sources (as we do for knowledge sources),
The utility of a referencable bundle of general provenance information that can be pointed at when a message or KG provides many Edges that share the same provenance.

Proposal 3c implemented in branch here, with PR https://github.com/biolink/biolink-model/pull/746

As indicated by Matt above, Proposal 3c implemented in branch here, with PR #746

biolink / biolink-model

Association slots for 'source provenance' tracking #716