thoughts on data model for tracking object provenance

dglazer commented 9 years ago

WARNING: long post below. I'm hoping to help us build a shared conceptual framework to guide ongoing API design, and given the underlying complexity and the number of different priors, don't know how to do it justice in fewer words.

My mental model of the genomics space is that there’s a “provenance chain” connecting objects that are derived from each other. Our API manages both data and metadata for the ‘dry’ objects (e.g. readgroupsets), and could be extended to manage metadata for the ‘wet’ objects (e.g. tissue samples). We currently do very little to represent the relationship between objects -- this issue explores what it would take to do more.

These thoughts are inspired by recent ongoing discussion about data models, including:

@jeromekelleher, in #380, pointing out the value of a well-defined data model
@diekhans, commenting on that issue (https://github.com/ga4gh/schemas/issues/380#issuecomment-132248587) to point out the importance of provenance tracking
@lh3, in #383, starting a good conversation about how to improve part of our provenance tracking
The Provenance Chain in the Real World

The overall chain starts with an organism, and for our purposes ends with variants. There are several links along the way, in each of which an input object is processed to generate an output object.

Here’s a diagram representing a typical chain. (I’m far from an expert on the upstream ‘wet’ steps; the downstream ‘dry’ steps show the objects currently managed by our API.)

tissue source ⇒ tissue sample ⇒ prepped sample ⇒ rgset ⇒ aligned rgset ⇒ call-column
\-------------- wet stuff (atoms) ------------/    \-------- dry stuff (bits) --------/

Here are some examples of the processing that happens at each step, which our API would need to manage to capture full provenance:

tissue source ==> tissue sample:
- how was this sample gathered? (date, tissue type, collection method, ...)
tissue sample ==> prepped sample
- how was this sample processed? (date, library prep, amplification, …)
prepped sample ==> unaligned readgroupset
- how was this readgroupset generated? (date, sequencer, settings, …)
unaligned readgroupset ==> aligned readgroupset
- how was this readgroupset aligned? (date, tools, versions, cmdline flags, …)
aligned readgroupset ==> call-column
- how were these variants called? (date, tools, versions, cmdline flags, …)

Note that it’s possible to process each input object in different ways at different times, which means there can be many outputs for each input. That leads to one-to-many relationships, where there can (e.g.) be multiple unaligned readgroupsets for a single prepped sample, or multiple aligned readgroupsets for a single unaligned one.

API Requirements to Represent the Provenance Chain

Object Definitions To capture the full provenance of each object, we need to represent:

who’s my parent? (i.e. the object from which I was derived)
- typically represented with some sort of id scheme, such as:
  - ad hoc user-populated text sitting in names or metadata (ugly, but often state of the art)
  - user-assigned ids that must be unique within some scope, such as a variantset
  - server-generated ids that are unique within a server instance
  - globally unique registry ids that refer to external registries of some sort
  - globally unique content-based ids, computed by content checksumming of some sort
how was I generated? (i.e. the processing done on my parent to generate me)
- typically represented with metadata fields
- we can use pre-defined first-class records and/or free-form key-value pairs

Method Definitions Applications that want to work with provenance might need to know:

what’s the full provenance of this object?
- one natural way to support this request is to include the provenance in the object definition, so it’s returned by the object Get methods
- another possible way is to define new GetProvenance(<object id>) methods for some object types
what are all the objects (of a given type) that came from this object?
- one way to support this request is to include optional “filter by parent” fields in the Search methods that return lists of objects
- another possible way is to define new GetChildObjects(<object id>) methods for some object types
  Drilling into API Design: A Sketch

Notes:

This section contains early, only partially fleshed out, personal opinions, included mostly to give a concrete example of how we could shape our API to reflect the data model discussed in this issue. We need to first reach agreement on the data model; then we can discuss the tradeoffs of different specific representations.
Many of the details here have been discussed at length by the Metadata TT. I’ve included a few concepts I first saw there, but have almost certainly left out many good ideas in this sketch.

Object Definitions

Every object needs an id, so its children can unambiguously refer to it.
- every object should have a server-generated id, that’s unique within a server instance. We should use these ids as the primary means of joining children to parents. (We can also have query methods that optionally join by other attributes for convenience.)
- we should use external registry ids whenever well-supported registries exist, but we can’t count on them existing for all object types, and we don’t want to force users to register every object. Therefore many object types should have an optional ‘external id’ field, but it should never be required.
- content-based ids are awesome, but they’re tricky to get right, and may take some time to nail down for all objects. Therefore we should adopt them wherever we’re comfortable there’s a good answer (as there is for references), but not force them into use elsewhere.
We should continue to use a mix of first-class fields and key-value arrays to describe how each object was derived from its parent.
- the details of which fields need to be first-class will be different for each object type
- in an ideal world the first-class fields would always be sufficient, but I don’t think we’ll get to that world quickly enough, and therefore should continue to support key-value arrays as an escape valve

Method Definitions

We should include full “how was I derived” info in the object, so it’s returned by every Get method.
We should add parent-id fields to most Search methods, so you can limit the results to only the children of a specific parent.
- for example, SearchReadGroupSetsRequest should optionally include the id of the prepped sample from which the reads were sequenced
- in some cases we’ll want to include grandparents and great-grandparents, so for example we might want to get a list of all the calls in a particular VariantSet that, via several intermediate steps, came from the same living organism
  Note on derived from vs. contains

In addition to the derived from relationships discussed above, there are also contains relationships where some objects logically include sets of other objects. For example:

a dataset logically contains one or more readgroupsets and variantsets
a readgroupset logically contains one or more readgroups, each of which logically contains many reads -- see the header comment in reads.avdl for details
a variantset logically contains one or more columns of calls

Those relationships are also important, but they’re distinct from the provenance chain. If we do want to discuss them, I suggest doing so in a separate thread. (Personally, I don’t think they need revisiting, at least for now -- they seem to be doing their job well.)

lh3 commented 9 years ago

One quick comment. What is missing here is the "sample" computational people talk about daily. It at times includes multiple prep samples/libraries, so is different from prepped sample/library. The name of this "sample" is the one we see in BAM and VCF header. It is unique in a project and links all types of data in the project.

mlin commented 9 years ago

I can strongly recommend looking at the recent work of the ENCODE Data Coordination Center for a tour de force on this topic. Example: https://www.encodeproject.org/experiments/ENCSR897KTO/ . Notice, it documents both the wet lab protocol and the bioinformatics. Their system is informed by lots of experience in biocuration databases now applied to NGS data processing, so it should provide a very suitable conceptual framework.

DNAnexus also has an automatic provenance tracking system which is distinct from ENCODE's ( although we work together) and not as sophisticated but slightly more generic. I will provide some illustrations when I get a chance.

diekhans commented 9 years ago

This think called `sample' is really the display name of variant calls on an alignment of read groups.

Calling this `sample; in a display to a user is a prefect way to summarize it.

Calling this `sample' in an API that are designed to accurately exchange data is a confusing misnomer. If it becomes truly necessary to have data structures to store display names to short-cut multiple queries to the API, that is reasonable. However, we need to get accurate representation of the data first.

I always interpreted `name' as display name and id as a unique id of the data. However, since this is not document, it turns out I am wrong.

Heng Li notifications@github.com writes:

One quick comment. What is missing here is the "sample" computational people talk about daily. It at times includes multiple prep samples/libraries, so is different from prepped sample/library. The name of this "sample" is the one we see in BAM and VCF header. It is unique in a project and links all types of data in the project.

— Reply to this email directly or view it on GitHub.*

diekhans commented 9 years ago

@mlin my personal litmus tests for the design of the GA4GH API is if we can cleanly and accurately represent TCGA and ICGA data.

Hopeful far more cleanly than the current two dimensional data files used by the DCC allow..

Mike Lin notifications@github.com writes:

I can strongly recommend looking at the recent work of the ENCODE Data Coordination Center for a tour de force on this topic. Example: https:// www.encodeproject.org/experiments/ENCSR897KTO/ . Notice, it documents both the wet lab protocol and the bioinformatics. Their system is informed by lots of experience in biocuration databases now applied to NGS data processing, so it should provide a very suitable conceptual framework.

DNAnexus also has an automatic provenance tracking system which is distinct from ENCODE's ( although we work together) and not as sophisticated but slightly more generic. I will provide some illustrations when I get a chance.

— Reply to this email directly or view it on GitHub.*

dglazer commented 9 years ago

@lh3 -- I don't know which specific wet objects we want to manage with our API. Today we don't formally support any, my original diagram shows a possible world where we choose to support three, and the metadata team is working on a richer hierarchy. We could choose to collapse it all into supporting exactly one for now, leading to something like:

 "informatics sample"   ⇒ rgset ⇒ aligned rgset ⇒ call-column
\- wet stuff (atoms) -/    \-------- dry stuff (bits) --------/

I think deciding which objects to support is important, and largely resolvable in parallel with the main thrust of this issue, which is how to represent provenance between those objects. Whichever objects we decide to support, I believe:

we should have good names and definitions for them. I'm still not clear on what an "informatics sample" is, but I'm sure we can clarify that if we decide to go that way
we need a framework like the one I describe, where there's a standard way for any object to say "here's my parent" and "here's how I was generated from my parent"

dglazer commented 9 years ago

@mlin -- thanks for the ENCODE pointers. At first glance, that experiment page you link to suggests that ENCODE is using a model that fits the framework discussed here -- several objects of different types, which each object having a way to say "here's my parent" and "here's how I was generated from my parent". Good if so, and I expect the metadata TT is familiar with their work already.

A quick search on their site didn't turn up any doc on the details of their object model (e.g. specifically how object ids are defined, and used to let objects refer to their parents), I found https://www.encodeproject.org/help/rest-api/, but it mostly says "GET on any object returns a bunch of JSON" and "here's how to invoke our free-text site-wide search". Are there more details somewhere else?

dglazer commented 9 years ago

@diekhans -- I agree we need clear and agreed-on names and definitions for the objects we choose to support, and I agree that "sample" doesn't pass that test, because of the many different priors people have about what it means. (I'm not necessarily pushing back on the concept itself; I just want to understand it better before I weigh in on how useful it is to our API.)

I think you're right (i.e you're wrong that you're wrong :) about:

I always interpreted `name' as display name and id as a unique id of the data. However, since this is not document, it turns out I am wrong.

I believe our API should use "id" for unique data ids, and "name" for display names. Today's API is almost fully consistent with that principle, modulo a few points:

as you say, we haven't documented the principle, leading to confusion
we haven't been clear on how id-generation should work, and therefore exactly how we define "unique"
there are one or two fields that are way under-specified, so it's hard to say if they're consistent (e.g. sampleId in the CallSet object)

I hope the discussion in this issue will lead to agreement and documentation of the principles, which we can then use to clean up those API rough edges.

mbaudis commented 9 years ago

@dglazer On the "wet side of things" there will be a possibly intricate object hierarchy, which will be incompletely represented for most use cases. We currently have something like Individual => Sample => Experiment (meaning here library prep., though I have problems with the technical limitation there) => Analysis (interpreted result). This would obviously a simple representation & we'll try to make this suitable for other cases (pooled samples and such).

I am a strong supporter of object id references irrespective of a strict object type (though there will have to be implementation checks). In principle, you just have to define a derivedFrom attribute (list), which then points to one or several parental objects (individual, sample, sample preparation).

This would be the most viable solution schema wise. However, for GA4GH compatible data management systems, one should point out an "ideal" structure (i.e. representing complete hierarchies).

lh3 commented 9 years ago

@mlin The Encode model you showed is closer to the process of data analysis, but our API mainly aims at accessing processed results. Related of course, but different in focus.

@diekhans I am not familiar with TCGA/ICGA data. Could you show a concrete use case where the model in #383 falls short? I find such challenges effective most of time.

@dglazer My general view is that the word "sample" computational people use daily (i.e. informatics sample -- a bad name I admit) has abstracted most biological complications away but is sufficient for most data access. This is the right type of object to model in API. This "sample", like many other container types, is subjectively defined by submitters within the scope of their own projects. Note that BioSample is not the right thing to model. For example, 1000g pilot 1 sequenced NA12878 with Illumina/454/solid. These are three SRA samples. We can find the information in the BAM header, but when we say "NA12878" in the context of pilot 1, it means the ensemble of the three biological samples.

EDIT: I realize the primary role of Sample in #383 is to match the "sample" in the mind of submitters.

diekhans commented 9 years ago

@dglazer, this is a very key comment:

`My mental model of the genomics space is that there’s a “provenance chain” connecting objects that are derived from each other.'

One of the biggest failures of bioinformatics is our tendency to draw scientific conclusion based on data of very nebulous origins. GenBank is a canonical example. Data is deposited with very loose provenance tracking and metadata of with varying and limited information.

Still GenBank is one of our primary resources for understanding the genome. Using GenBank is closer to archaeology that experimental science. Horror stories abound. GenBank use to have more than 30,000 mRNA sequences that were actually naive gene predictions bases on ESTs. Once we finally figure this out, it took years to get the entries removed.

David Glazer notifications@github.com writes:

Our API manages both data and metadata for the ‘dry’ objects (e.g. readgroupsets), and could be extended to manage metadata for the ‘wet’ objects (e.g. tissue samples). We currently do very little to represent the relationship between objects -- this issue explores what it would take to do more.

These thoughts are inspired by recent ongoing discussion about data models, including:

• @jeromekelleher, in #380, pointing out the value of a well-defined data model • @diekhans, commenting on that issue (#380 (comment)) to point out the importance of provenance tracking • @lh3, in #383, starting a good conversation about how to improve part of our provenance tracking

The Provenance Chain in the Real World

The overall chain starts with an organism, and for our purposes ends with variants. There are several links along the way, in each of which an input object is processed to generate an output object.

Here’s a diagram representing a typical chain. (I’m far from an expert on the upstream ‘wet’ steps; the downstream ‘dry’ steps show the objects currently managed by our API.)

tissue source ⇒ tissue sample ⇒ prepped sample ⇒ rgset ⇒ aligned rgset ⇒ call-column -------------- wet stuff (atoms) ------------/ -------- dry stuff (bits) --------/

Here are some examples of the processing that happens at each step, which our API would need to manage to capture full provenance:

• tissue source ==> tissue sample: □ how was this sample gathered? (date, tissue type, collection method, ...) • tissue sample ==> prepped sample □ how was this sample processed? (date, library prep, amplification, …) • prepped sample ==> unaligned readgroupset □ how was this readgroupset generated? (date, sequencer, settings, …) • unaligned readgroupset ==> aligned readgroupset □ how was this readgroupset aligned? (date, tools, versions, cmdline flags, …) • aligned readgroupset ==> call-column □ how were these variants called? (date, tools, versions, cmdline flags, …)

Note that it’s possible to process each input object in different ways at different times, which means there can be many outputs for each input. That leads to one-to-many relationships, where there can (e.g.) be multiple unaligned readgroupsets for a single prepped sample, or multiple aligned readgroupsets for a single unaligned one.

API Requirements to Represent the Provenance Chain

Object Definitions To capture the full provenance of each object, we need to represent:

who’s my parent? (i.e. the object from which I was derived) □ typically represented with some sort of id scheme, such as: ☆ ad hoc user-populated text sitting in names or metadata (ugly, but often state of the art) ☆ user-assigned ids that must be unique within some scope, such as a variantset ☆ server-generated ids that are unique within a server instance ☆ globally unique registry ids that refer to external registries of some sort ☆ globally unique content-based ids, computed by content checksumming of some sort

how was I generated? (i.e. the processing done on my parent to generate me) □ typically represented with metadata fields □ we can use pre-defined first-class records and/or free-form key-value pairs

Method Definitions Applications that want to work with provenance might need to know:
what’s the full provenance of this object? □ one natural way to support this request is to include the provenance in the object definition, so it’s returned by the object Get methods □ another possible way is to define new GetProvenance() methods for some object types
what are all the objects (of a given type) that came from this object? □ one way to support this request is to include optional “filter by parent” fields in the Search methods that return lists of objects □ another possible way is to define new GetChildObjects(

ga4gh / ga4gh-schemas

thoughts on data model for tracking object provenance #390

The Provenance Chain in the Real World

API Requirements to Represent the Provenance Chain

Drilling into API Design: A Sketch

Note on derived from vs. contains

383 fails to address data provenance by collapsing a complicated chain