NCATSTranslator / ReasonerAPI

NCATS Biomedical Translator Reasoners Standard API
34 stars 28 forks source link

Representing 'source retrieval provenance' in merged edges #369

Open mbrush opened 2 years ago

mbrush commented 2 years ago

A dedicated model to represent 'source retrieval provenance' been proposed/discussed in several recent meetings - to better support emerging use cases around edge merging and answer debugging. The key requirement for the edge merging use case is to represent an ordered tree of retrievals that result from edge merging operations, where it is clear which source was primary/original, and which were aggregators. Several approaches have been proposed and are discussed in the document here.

The general consensus from recent calls is summarized below:

  1. There is interest in exploring a dedicated structure in TRAPI model for retrieval provenance (as opposed to using nested Attributes). Minimally this would require a new type of object to hold retrieval provenance metadata, and a new edge property to point at it.
  2. We should start simple and focus on core requirements for edge merging. Avoid nested objects to the extent possible, and do not worry about provenance metadata concerning each retrieval operation at this point (when, who, how, access url, etc). But we may want a model that can be easily expanded to support this in the future (this is a key question that will play into choice of proposals).

These priorities focused us on two candidate approaches:

Data Examples illustrate how these two approaches would represent two retrieval scenarios (see diagrams below, and further described in the Google document:

image

image


Finally, note that this is related to broader question of retaining EPC in merged edges, as discussed in #313.

mbrush commented 1 year ago

Adding a slight twist on Candidate A (lets call it candidate A.1) that lets us use Attribute objects but offers some degree of structural separation of retrieval provenance Attributes (which is one draw of Candidate B), from Attribute objects holding other types of edge metadata. It requires only the creation of a dedicated Edge property separate from attributes that will hold Attribute objects used to describe source retrieval provenance (we might call this property retrieval_provenance_attributes, or just retrieval_attributes).

This would begin to address one of the concerns raised about Candidate A - which is that it is hard to find/assemble Attribute objects describing retrieval provenance amongst that potentially tens of other attribute objects hanging from a given Edge.


  "edges": {
    "id": "e719491"
    "subject": "RXCUI:1544384",
    "predicate": "biolink:correlated_with",
    "object": "MONDO:0008383",
    "attributes": [ ]
    "retrieval_attributes": [  ]