Representing different levels of provenance for 'informational entities'

Informational entities such as a VA statements can be represented at different 'levels of abstraction' - from an object representing the statement's abstract information content, to an object representing a specific encoding of that information in a particular format, syntax, and location. For our purposes let's call the abstract statement a knowledge-level object and its concrete encodings resource- or record-level objects (these are often referred to as "digital artifacts" or "web resources"). Background reading on these concepts and relevant models/terminologies can be found at [1], [2], [3].

As an example, consider the statement captured in ClinVar SCV000301326. At the abstract level there is a single statement being made here - a claim made by the ENIGMA Consortium on Sep 8, 2016 that "NM_000059.3:c.8969G>A is pathogenic for Breast-ovarian cancer, familial 2". However, there may be multiple concrete forms in which this one knowledge-level statement is encoded - at different times, by different people, and in different formats. For example, as a sentence in a publication, a blob of XML data as held in ClinVar's systems, or a blob of JSON created by CellBase after ingesting and transforming the ClinVar record. These concrete representations are considered distinct resource-level artifacts that express the same underlying statement.

Importantly, each of these resource-level artifacts has its own provenance at this concrete level of representation (who created the XML or the JSON, when, and how) that is distinct from the provenance of the knowledge level statement itself (who asserted the knowledge, when, and how). Different provenance metadata may apply for the knowledge-level or resource-level representations of a statement - and we will likely need to support capturing this provenance metadata at both levels. Provenance at the knowledge-level is generally concerned with knowledge creation: who, when, and how the original statement was put forth as true. It is agnostic to the concrete form in which the statement may be represented. Provenance at the resource-level is generally concerned with data object generation: how a particular concrete form/expression of the statement was created in a specific format and location, what agents/tools were used in doing so, and what external sources were ingested/transformed in the process.

Below we list different types of provenance information that may be relevant to capture for a statement at the knowledge- and/or resource- level. The provided examples are based on a scenario in which the variant annotation aggregator CellBase ingests and transforms a ClinVar SCV into a json-encoded digital artifact it provides via its API, and what provenance it might want to capture:

The Agent (person, organization, computational agent) who originally asserted the statement to be true (i.e. created the information content of the statement) . . . and when/how they did this. e.g. ENIGMA Consortium, using the ENIGMA Consortium 2015 criteria, on 2016-09-08
The Agent/Tool that created a specific concrete encoding of the statement (i.e. created a particular digital artifact/resource encoding of this statement - such as a VA-compliant json representation, versus the native ClinVar xml representation of this same SCV) . . . and when/how this was done. e.g. the CellBase software tool
An Agent who provided the statement to some aggregator to ingest and provide. e.g. the ClinVar Organization
A specific external record of the statement from which the Cellbase record/encoding was derived. e.g. the ClinVar XML record of the SCV in the ClinVar database
The external database/dataset from which the Cellbase record/encoding of the statement was ingested/transformed. e.g. ClinVar dataset, release x.x
The internal dataset/database in which the CellBase record is currently expressed. e.g. Cellbase dataset, release x.x
Other information entities/statements that provided evidence supporting the statement being made in the first place. e.g. a prior assertion that a different genomic variant which results in the same protein change was pathogenic for Breast-ovarian cancer.

I would argue that some of these metadata describe provenance of the knowledge-level statement and the information content it carries (specifically no. 1 and 7 above), while other metadata describe provenance of a specific resource-level record of the statement in CellBase as a concrete digital artifact (no. 2, 3, 4, 5, 6 above). Particularly important is distinguishing the two notions of the "creator" of the statement in no. 1 and no. 2. Here we must separately consider the creator of the information content held in the statement - i.e. who put it forth as true (no. 1), and the creator of a particular concrete encoding as a digital artifact/resource (no. 2) - which are typically different Agents.

References:

the FRBR model ('Works', 'Expressions', 'Manifestations', 'Items' as levels at which creative works can be cataloged/described)
the HCLS Dataset Description model ('Summary', 'Version', and 'Distribution' level representations of datasets):
the Basic Formal Ontology (BFO) and Information Artifact Ontology (IAO) (describes 'Information Content Entities' and their 'concretizations'): ,

So how might we model the metadata and distinctions laid out above?

One approach is to create separate objects in the data to represent abstract, knowledge-level statements, and their various concrete resource-level encodings. But in practice I suspect many implementations would find this untenable, adding a complexity to support nuance that users are happy to elide over, and posing a computational burden for creating and operating on these extra objects in the data. That said, I do think we need to be clear about these distinctions between knowledge- and resource- level provenance metadata in our model/documentation, and provide structure that separates them so they can be consistently captured by data creators and unambiguously understood by users.

Another approach (based on how SEPIO has proposed to handle this) is to allow knowledge- and resource-level metadata to hang from a single object, but organize them in separate structures so the distinction is clear. Here we would hang knowledge-level evidence and provenance described in no. 1 and no. 7 above directly from a statement object. The resource-level provenance described in no. 2-6 above, which capture metadata that applies to a particular encoding of the statement, are grouped together under a recordMeta element.

This approach implicitly treats the statement object as a resource-level entity, and provides a clear structural separation between provenance metadata about the knowledge-level statement and that about the concrete digital resource encoding this statement - but does not require separate objects for representing the statement at both of these levels.

Applying this model to the Cellbase-ingested ClinVar SCV from above may look something like the following. Note all attribute names are provisional, but many are based on names used by established standards / terminologies for capturing such things (e.g. PROV, PAV, HCLS, DublinCore, etc.).

{
# Base elements for the statement object
"id": "ex:12345"
"type": "va:PathogenicityInterpretation"
"description": "ENIGMA's assertion that NM_000059.3(BRCA2):c.8969G>A is pathogenic for Breast-ovarian cancer, familial 2."

# Structured semantics of the statement, according to the VA model
"subject": "ga4gh:vr_2411346SAt4aG4tAh6436"
"predicate": "is pathogenic for"
"descriptor": "Breast-ovarian cancer, familial 2"
"variantOriginQualifier": "germline"

# Knowledge-level provenance info (who, when, how the knowledge was created)
"authoredBy": "ENIGMA consortium"   #who created the knowledge
"dateAuthored":  "2016-09-08"  #when the knowledge was created
"specifiedBy": "ENIGMA Classification Criteria (2015)" #method used to create the knowledge
"hasEvidenceItem": "clinvar:SCV4614643"  #a different statement used as evidence 

# Record-level provenance metadata grouped under a recordMeta element, describing the provenance of this particular record of the statement and how it came to be in this database
"recordMeta":  {
    "recordedBy": "CellBase Organization" #Agent who created this concrete record of the statement
    "dateRecorded": "2019-09-07" #when this particular record was created
    "source": "ClinVar June 2019 release" #database/set from which the source statement was obtained
    "providedBy": "ClinVar Organization" #Agent who provided or granted access to the database/set from which the statement was obtained  
    "derivedFrom": "clinvar: SCV000301326" #specific record in the source that this record was derived from (with retrieval or transformation)
    "part of": "CellBase version 1.5.2" # larger database/set of which this record is a part
    "url": "http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/hsapiens/genomic/variant/19:45411941:T:C/annotation"
    }
}

The idea is that this recordMeta object (or perhaps a complex data type) would be a re-usable structure that could be included and populated as desired for any object in the data - in cases where resource-level provenance is important to describe alongside knowledge-level provenance. At the end of the day this proposal is a very simple solution that does not require the user understanding the ontological minutia/nuance discussed above. It allows for capturing all types of provenance metadata we have found in examples inside the Statement object, but simply nests those relevant for the record-level encoding of the statement. Each provenance attribute/element will be clearly defined so it is clear where a given piece of metadata belongs - making things easy on data creators.

Note there is an analogy of the proposed ResourceMetadata object to the FHIR 'Meta' resource: https://www.hl7.org/fhir/resource.html#Meta . . . not exactly the same thing, but similar in purpose and implementation.

Discussion Questions:

Does the knowledge- vs resource-level distinction we define make sense, and seem a legitimate one to make in our model? (happy to consider other names for these levels)
Are the types of provenance metadata we enumerate clear and comprehensive? Are their mappings to knowledge- and resource-levels correct?
Does the proposal to use a nested element to capture resource-level metadata separate from knowledge-level metadata seem pragmatic and useful?
- Does this strike the right balance w.r.t barrier to adoption and clarity/computability?
- Do we foresee any problems this may cause (use cases not met, confusion among implementers or users, pushback for being to complex/onerous or to simplistic?)

A consequence of the proposed approach is that statement objects in a given dataset would be identified as resource-level entities. If two databases hold records of the same abstract statement, formally these are different instances at the resource level. We can provide a way to assert or infer equivalence at the knowledge-level between such statement records if we need to, but formally they are different entities.

An alternative approach that keeps the statement object pure as a knowledge-level entity is to capture record-level metadata in the Variant Annotation object that wraps Statements and organizes them with related information to provide context for their interpretation and use. This would mean placing the resourceMeta element in the Annotation object - so the Statement object remains free of resource/record-specific information. Implications of this are that:

If the annotation holds several statements (e.g. a primary and several supporting), we would potentially need several resourceMeta objects to separately capture resource-level provenance metadata for each statements.
Serializations that aim to provide simple lists of statements could not provide resource-level metadata without wrapping each statement in an annotation object. This is an important and widely applicable use case, where I suspect that data creators want to be able to attach resource-level metadata directly to a statement object so as not to have to wrap each in an annotation. Confirm with javild.

re point 7 above - for engima (and B challenge) and ClinVar 0 you will need to include more layers namely - there are multiple acmg evidence codes, and for each one there might be muptiple different sources that together provide sufficient information for a code to be met. ifurther, even if there is MORe than enough information to eet a code, it is alwasy good to put it all ehter, since otherwise someone asks why it is missing!

re point 1 above - this might seem a bit circular, but do we need to somehow establish the provenance of the Agent? As described, we are making the assumption that the Agent is an "honest broker" who can be relied on to provide relaible data. Some Agents might be well known and familiar to data users, but some will not. How would I get in contact with Enigma if I knew nothing about them?

I did think about this during yesterday's TC, but things moved along so rapidly that thought it best not to disturb the flow.

@mbrush in general i find the proposal to keep the resourceMetaData outside the annotation compelling. The arguments and examples you laid out above are great IMO. However, I think I would need to see a few more before I truly understand the implications. As you know I am quite familiar withe the SEPIO model so I am falling back to comparing this with that approach.

As I look at the following excerpt from your example above

# Knowledge-level provenance info (who, when, how the knowledge was created)
"statedBy": "ENIGMA consortium"   #who created the knowledge
"dateStated":  "2016-09-08"  #when the knowledge was created
"specifiedBy": "ENIGMA Classification Criteria (2015)" #method used to create the knowledge
"hasEvidenceItem": "clinvar:SCV4614643"  #a different statement used as evidence

I am trying to understand the hasEvidenceItem attribute within the provenance for the knowledge statement. I may be confused but is this where all evidence items would hang that support the knowledge level assertion? Wether or not it is it would be nice to see a slightly more complex and real example with "several" evidence items just so I can connect the dots to how we've been doing things with SEPIO.

My perception is that this record-level provenance meta object would be very helpful in solving a problem we have yet to figure out. I don't think we are really capturing this kind of data yet, even though it is something we all agree is important and needs to be done.

Lastly, I will note that HL7 FHIR uses derivedFrom in their Observation resources along with hasMembers to enable groupings and associations that may intersect with some of the concepts you are presenting here. And while they've tried to keep their Resources simple by not creating new concepts (i.e. data types or resources), they seem to end up with these fairly denormalized structures that bundle all this kind of information together. And I must say it's challenging to figure out how to use these concepts in a "standard and resusable" way with others.

The current proposal for the RecordMetadata schema for the v0 VA Spec is summarized in the doc here, and the schema captured in the spreadsheet here.

Another potential use case for record-level metadata came up on the April 1 2020 VA call - exploring how to use this model to represent the 'source' of a transcript definition. i.e. where the information in a Transcript object came from. This might fit in the RecordMetadata.derivedFromSource field.

Use this as a test case to evaluate the proposed Record Metadata model.

ga4gh / va-spec

Representing different levels of provenance for 'informational entities' #49