ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 110 forks source link

thoughts on data model for tracking object provenance #390

Open dglazer opened 9 years ago

dglazer commented 9 years ago

WARNING: long post below. I'm hoping to help us build a shared conceptual framework to guide ongoing API design, and given the underlying complexity and the number of different priors, don't know how to do it justice in fewer words.

My mental model of the genomics space is that there’s a “provenance chain” connecting objects that are derived from each other. Our API manages both data and metadata for the ‘dry’ objects (e.g. readgroupsets), and could be extended to manage metadata for the ‘wet’ objects (e.g. tissue samples). We currently do very little to represent the relationship between objects -- this issue explores what it would take to do more.

These thoughts are inspired by recent ongoing discussion about data models, including:

The overall chain starts with an organism, and for our purposes ends with variants. There are several links along the way, in each of which an input object is processed to generate an output object.

Here’s a diagram representing a typical chain. (I’m far from an expert on the upstream ‘wet’ steps; the downstream ‘dry’ steps show the objects currently managed by our API.)

tissue source ⇒ tissue sample ⇒ prepped sample ⇒ rgset ⇒ aligned rgset ⇒ call-column
\-------------- wet stuff (atoms) ------------/    \-------- dry stuff (bits) --------/

Here are some examples of the processing that happens at each step, which our API would need to manage to capture full provenance:

Note that it’s possible to process each input object in different ways at different times, which means there can be many outputs for each input. That leads to one-to-many relationships, where there can (e.g.) be multiple unaligned readgroupsets for a single prepped sample, or multiple aligned readgroupsets for a single unaligned one.

API Requirements to Represent the Provenance Chain

Object Definitions To capture the full provenance of each object, we need to represent:

  1. who’s my parent? (i.e. the object from which I was derived)
    • typically represented with some sort of id scheme, such as:
      • ad hoc user-populated text sitting in names or metadata (ugly, but often state of the art)
      • user-assigned ids that must be unique within some scope, such as a variantset
      • server-generated ids that are unique within a server instance
      • globally unique registry ids that refer to external registries of some sort
      • globally unique content-based ids, computed by content checksumming of some sort
  2. how was I generated? (i.e. the processing done on my parent to generate me)
    • typically represented with metadata fields
    • we can use pre-defined first-class records and/or free-form key-value pairs

Method Definitions Applications that want to work with provenance might need to know:

  1. what’s the full provenance of this object?
    • one natural way to support this request is to include the provenance in the object definition, so it’s returned by the object Get methods
    • another possible way is to define new GetProvenance(<object id>) methods for some object types
  2. what are all the objects (of a given type) that came from this object?
    • one way to support this request is to include optional “filter by parent” fields in the Search methods that return lists of objects
    • another possible way is to define new GetChildObjects(<object id>) methods for some object types

      Drilling into API Design: A Sketch

Notes:

Object Definitions

  1. Every object needs an id, so its children can unambiguously refer to it.
    • every object should have a server-generated id, that’s unique within a server instance. We should use these ids as the primary means of joining children to parents. (We can also have query methods that optionally join by other attributes for convenience.)
    • we should use external registry ids whenever well-supported registries exist, but we can’t count on them existing for all object types, and we don’t want to force users to register every object. Therefore many object types should have an optional ‘external id’ field, but it should never be required.
    • content-based ids are awesome, but they’re tricky to get right, and may take some time to nail down for all objects. Therefore we should adopt them wherever we’re comfortable there’s a good answer (as there is for references), but not force them into use elsewhere.
  2. We should continue to use a mix of first-class fields and key-value arrays to describe how each object was derived from its parent.
    • the details of which fields need to be first-class will be different for each object type
    • in an ideal world the first-class fields would always be sufficient, but I don’t think we’ll get to that world quickly enough, and therefore should continue to support key-value arrays as an escape valve

Method Definitions

  1. We should include full “how was I derived” info in the object, so it’s returned by every Get method.
  2. We should add parent-id fields to most Search methods, so you can limit the results to only the children of a specific parent.
    • for example, SearchReadGroupSetsRequest should optionally include the id of the prepped sample from which the reads were sequenced
    • in some cases we’ll want to include grandparents and great-grandparents, so for example we might want to get a list of all the calls in a particular VariantSet that, via several intermediate steps, came from the same living organism

      Note on derived from vs. contains

In addition to the derived from relationships discussed above, there are also contains relationships where some objects logically include sets of other objects. For example:

Those relationships are also important, but they’re distinct from the provenance chain. If we do want to discuss them, I suggest doing so in a separate thread. (Personally, I don’t think they need revisiting, at least for now -- they seem to be doing their job well.)

lh3 commented 9 years ago

One quick comment. What is missing here is the "sample" computational people talk about daily. It at times includes multiple prep samples/libraries, so is different from prepped sample/library. The name of this "sample" is the one we see in BAM and VCF header. It is unique in a project and links all types of data in the project.

mlin commented 9 years ago

I can strongly recommend looking at the recent work of the ENCODE Data Coordination Center for a tour de force on this topic. Example: https://www.encodeproject.org/experiments/ENCSR897KTO/ . Notice, it documents both the wet lab protocol and the bioinformatics. Their system is informed by lots of experience in biocuration databases now applied to NGS data processing, so it should provide a very suitable conceptual framework.

DNAnexus also has an automatic provenance tracking system which is distinct from ENCODE's ( although we work together) and not as sophisticated but slightly more generic. I will provide some illustrations when I get a chance.

diekhans commented 9 years ago

This think called `sample' is really the display name of variant calls on an alignment of read groups.

Calling this `sample; in a display to a user is a prefect way to summarize it.

Calling this `sample' in an API that are designed to accurately exchange data is a confusing misnomer. If it becomes truly necessary to have data structures to store display names to short-cut multiple queries to the API, that is reasonable. However, we need to get accurate representation of the data first.

I always interpreted `name' as display name and id as a unique id of the data. However, since this is not document, it turns out I am wrong.

Heng Li notifications@github.com writes:

One quick comment. What is missing here is the "sample" computational people talk about daily. It at times includes multiple prep samples/libraries, so is different from prepped sample/library. The name of this "sample" is the one we see in BAM and VCF header. It is unique in a project and links all types of data in the project.

— Reply to this email directly or view it on GitHub.*

diekhans commented 9 years ago

@mlin my personal litmus tests for the design of the GA4GH API is if we can cleanly and accurately represent TCGA and ICGA data.

Hopeful far more cleanly than the current two dimensional data files used by the DCC allow..

Mike Lin notifications@github.com writes:

I can strongly recommend looking at the recent work of the ENCODE Data Coordination Center for a tour de force on this topic. Example: https:// www.encodeproject.org/experiments/ENCSR897KTO/ . Notice, it documents both the wet lab protocol and the bioinformatics. Their system is informed by lots of experience in biocuration databases now applied to NGS data processing, so it should provide a very suitable conceptual framework.

DNAnexus also has an automatic provenance tracking system which is distinct from ENCODE's ( although we work together) and not as sophisticated but slightly more generic. I will provide some illustrations when I get a chance.

— Reply to this email directly or view it on GitHub.*

dglazer commented 9 years ago

@lh3 -- I don't know which specific wet objects we want to manage with our API. Today we don't formally support any, my original diagram shows a possible world where we choose to support three, and the metadata team is working on a richer hierarchy. We could choose to collapse it all into supporting exactly one for now, leading to something like:

 "informatics sample"   ⇒ rgset ⇒ aligned rgset ⇒ call-column
\- wet stuff (atoms) -/    \-------- dry stuff (bits) --------/

I think deciding which objects to support is important, and largely resolvable in parallel with the main thrust of this issue, which is how to represent provenance between those objects. Whichever objects we decide to support, I believe:

dglazer commented 9 years ago

@mlin -- thanks for the ENCODE pointers. At first glance, that experiment page you link to suggests that ENCODE is using a model that fits the framework discussed here -- several objects of different types, which each object having a way to say "here's my parent" and "here's how I was generated from my parent". Good if so, and I expect the metadata TT is familiar with their work already.

A quick search on their site didn't turn up any doc on the details of their object model (e.g. specifically how object ids are defined, and used to let objects refer to their parents), I found https://www.encodeproject.org/help/rest-api/, but it mostly says "GET on any object returns a bunch of JSON" and "here's how to invoke our free-text site-wide search". Are there more details somewhere else?

dglazer commented 9 years ago

@diekhans -- I agree we need clear and agreed-on names and definitions for the objects we choose to support, and I agree that "sample" doesn't pass that test, because of the many different priors people have about what it means. (I'm not necessarily pushing back on the concept itself; I just want to understand it better before I weigh in on how useful it is to our API.)

I think you're right (i.e you're wrong that you're wrong :) about:

I always interpreted `name' as display name and id as a unique id of the data. However, since this is not document, it turns out I am wrong.

I believe our API should use "id" for unique data ids, and "name" for display names. Today's API is almost fully consistent with that principle, modulo a few points:

I hope the discussion in this issue will lead to agreement and documentation of the principles, which we can then use to clean up those API rough edges.

mbaudis commented 9 years ago

@dglazer On the "wet side of things" there will be a possibly intricate object hierarchy, which will be incompletely represented for most use cases. We currently have something like Individual => Sample => Experiment (meaning here library prep., though I have problems with the technical limitation there) => Analysis (interpreted result). This would obviously a simple representation & we'll try to make this suitable for other cases (pooled samples and such).

I am a strong supporter of object id references irrespective of a strict object type (though there will have to be implementation checks). In principle, you just have to define a derivedFrom attribute (list), which then points to one or several parental objects (individual, sample, sample preparation).

This would be the most viable solution schema wise. However, for GA4GH compatible data management systems, one should point out an "ideal" structure (i.e. representing complete hierarchies).

lh3 commented 9 years ago

@mlin The Encode model you showed is closer to the process of data analysis, but our API mainly aims at accessing processed results. Related of course, but different in focus.

@diekhans I am not familiar with TCGA/ICGA data. Could you show a concrete use case where the model in #383 falls short? I find such challenges effective most of time.

@dglazer My general view is that the word "sample" computational people use daily (i.e. informatics sample -- a bad name I admit) has abstracted most biological complications away but is sufficient for most data access. This is the right type of object to model in API. This "sample", like many other container types, is subjectively defined by submitters within the scope of their own projects. Note that BioSample is not the right thing to model. For example, 1000g pilot 1 sequenced NA12878 with Illumina/454/solid. These are three SRA samples. We can find the information in the BAM header, but when we say "NA12878" in the context of pilot 1, it means the ensemble of the three biological samples.

EDIT: I realize the primary role of Sample in #383 is to match the "sample" in the mind of submitters.

diekhans commented 9 years ago

@dglazer, this is a very key comment:

`My mental model of the genomics space is that there’s a “provenance chain” connecting objects that are derived from each other.'

One of the biggest failures of bioinformatics is our tendency to draw scientific conclusion based on data of very nebulous origins. GenBank is a canonical example. Data is deposited with very loose provenance tracking and metadata of with varying and limited information.

Still GenBank is one of our primary resources for understanding the genome. Using GenBank is closer to archaeology that experimental science. Horror stories abound. GenBank use to have more than 30,000 mRNA sequences that were actually naive gene predictions bases on ESTs. Once we finally figure this out, it took years to get the entries removed.

David Glazer notifications@github.com writes:

Our API manages both data and metadata for the ‘dry’ objects (e.g. readgroupsets), and could be extended to manage metadata for the ‘wet’ objects (e.g. tissue samples). We currently do very little to represent the relationship between objects -- this issue explores what it would take to do more.

These thoughts are inspired by recent ongoing discussion about data models, including:

• @jeromekelleher, in #380, pointing out the value of a well-defined data model • @diekhans, commenting on that issue (#380 (comment)) to point out the importance of provenance tracking • @lh3, in #383, starting a good conversation about how to improve part of our provenance tracking

The Provenance Chain in the Real World

The overall chain starts with an organism, and for our purposes ends with variants. There are several links along the way, in each of which an input object is processed to generate an output object.

Here’s a diagram representing a typical chain. (I’m far from an expert on the upstream ‘wet’ steps; the downstream ‘dry’ steps show the objects currently managed by our API.)

tissue source ⇒ tissue sample ⇒ prepped sample ⇒ rgset ⇒ aligned rgset ⇒ call-column -------------- wet stuff (atoms) ------------/ -------- dry stuff (bits) --------/

Here are some examples of the processing that happens at each step, which our API would need to manage to capture full provenance:

• tissue source ==> tissue sample: □ how was this sample gathered? (date, tissue type, collection method, ...) • tissue sample ==> prepped sample □ how was this sample processed? (date, library prep, amplification, …) • prepped sample ==> unaligned readgroupset □ how was this readgroupset generated? (date, sequencer, settings, …) • unaligned readgroupset ==> aligned readgroupset □ how was this readgroupset aligned? (date, tools, versions, cmdline flags, …) • aligned readgroupset ==> call-column □ how were these variants called? (date, tools, versions, cmdline flags, …)

Note that it’s possible to process each input object in different ways at different times, which means there can be many outputs for each input. That leads to one-to-many relationships, where there can (e.g.) be multiple unaligned readgroupsets for a single prepped sample, or multiple aligned readgroupsets for a single unaligned one.

API Requirements to Represent the Provenance Chain

Object Definitions To capture the full provenance of each object, we need to represent:

  1. who’s my parent? (i.e. the object from which I was derived) □ typically represented with some sort of id scheme, such as: ☆ ad hoc user-populated text sitting in names or metadata (ugly, but often state of the art) ☆ user-assigned ids that must be unique within some scope, such as a variantset ☆ server-generated ids that are unique within a server instance ☆ globally unique registry ids that refer to external registries of some sort ☆ globally unique content-based ids, computed by content checksumming of some sort
  2. how was I generated? (i.e. the processing done on my parent to generate me) □ typically represented with metadata fields □ we can use pre-defined first-class records and/or free-form key-value pairs

Method Definitions Applications that want to work with provenance might need to know:

  1. what’s the full provenance of this object? □ one natural way to support this request is to include the provenance in the object definition, so it’s returned by the object Get methods □ another possible way is to define new GetProvenance() methods for some object types
  2. what are all the objects (of a given type) that came from this object? □ one way to support this request is to include optional “filter by parent” fields in the Search methods that return lists of objects □ another possible way is to define new GetChildObjects() methods for some object types

    Drilling into API Design: A Sketch

    Notes:

    • This section contains early, only partially fleshed out, personal opinions, included mostly to give a concrete example of how we could shape our API to reflect the data model discussed in this issue. We need to first reach agreement on the data model; then we can discuss the tradeoffs of different specific representations. • Many of the details here have been discussed at length by the Metadata TT. I’ve included a few concepts I first saw there, but have almost certainly left out many good ideas in this sketch.

    Object Definitions

    1. Every object needs an id, so its children can unambiguously refer to it.

      □ every object should have a server-generated id, that’s unique within a server instance. We should use these ids as the primary means of joining children to parents. (We can also have query methods that optionally join by other attributes for convenience.) □ we should use external registry ids whenever well-supported registries exist, but we can’t count on them existing for all object types, and we don’t want to force users to register every object. Therefore many object types should have an optional ‘external id’ field, but it should never be required. □ content-based ids are awesome, but they’re tricky to get right, and may take some time to nail down for all objects. Therefore we should adopt them wherever we’re comfortable there’s a good answer (as there is for references), but not force them into use elsewhere.

    2. We should continue to use a mix of first-class fields and key-value arrays to describe how each object was derived from its parent.

      □ the details of which fields need to be first-class will be different for each object type □ in an ideal world the first-class fields would always be sufficient, but I don’t think we’ll get to that world quickly enough, and therefore should continue to support key-value arrays as an escape valve

    Method Definitions

    1. We should include full “how was I derived” info in the object, so it’s returned by every Get method.
    2. We should add parent-id fields to most Search methods, so you can limit the results to only the children of a specific parent. □ for example, SearchReadGroupSetsRequest should optionally include the id of the prepped sample from which the reads were sequenced □ in some cases we’ll want to include grandparents and great-grandparents, so for example we might want to get a list of all the calls in a particular VariantSet that, via several intermediate steps, came from the same living organism

    Note on derived from vs. contains

    In addition to the derived from relationships discussed above, there are also contains relationships where some objects logically include sets of other objects. For example:

    • a dataset logically contains one or more readgroupsets and variantsets • a readgroupset logically contains one or more readgroups, each of which logically contains many reads -- see the header comment in reads.avdl for details • a variantset logically contains one or more columns of calls

    Those relationships are also important, but they’re distinct from the provenance chain. If we do want to discuss them, I suggest doing so in a separate thread. (Personally, I don’t think they need revisiting, at least for now -- they seem to be doing their job well.)

    — Reply to this email directly or view it on GitHub.*

    diekhans commented 9 years ago

    Heng Li notifications@github.com writes:

    @mlin The Encode model you showed is closer to the process of data analysis, but our API mainly aims at accessing processed results. Related of course, but different in focus.

    @lh3, you have brought up root of all of this confusion!!

    Many of us don't see the goal of the API as accessing processed results. We feel that goal is to be able to accurately exchange genomic data in an interoperable way. This includes the full spectrum of metadata that describes data object provides provenance information on how it's tracked.

    If the goal is access results, BAM already works great and VCF is tolerable. Why spend so much time in conference calls?

    @diekhans I am not familiar with TCGA/ICGA data. Could you show a concrete use case where the model in #383 falls short? I find such challenges effective most of time.

    Here is the current data TCGA data primer. A lot to dig through. https://wiki.nci.nih.gov/display/TCGA/TCGA+Data+Primer

    The sample through >variant call data model more or less follows the old SRA data model of

    Sample -> Experiment -> Run -> Analysis(mapping) -> Analysis(variant calling)

    [note that SRA has since collapsed Run and Analysis and in the process lost a lot of ReadGroup metadata. Broad and UCSC were ignored on this concern]

    383 fails to address data provenance by collapsing a complicated chain

    of events. This leads to naive interpretation of the data. Batch effects are very important at each level of analysis.

    Once can always can, and usually should, collapse a rich data view for display purposes.

    @dglazer My general view is that the word "sample" computational people use daily (i.e. informatics sample -- a bad name I admit) has abstracted most biological complications away but is sufficient for most data access. This is the right type of object to model in API. This "sample", like many other container types, is subjectively defined by submitters within the scope of their own projects. Note that BioSample is not the right thing to model. For example, 1000g pilot 1 sequenced NA12878 with Illumina/454/solid. These are three SRA samples. We can find the information in the BAM header, but when we say "NA12878" in the context of pilot 1, it means the ensemble of the three samples.

    — Reply to this email directly or view it on GitHub.*

    diekhans commented 9 years ago

    This PR is a excellent start at a data model that accurately represent the data. Some of the terminology differs from the common practices, but the ideas are the same. Will try to dig up some doc on other systems representing this.

    Much of this is already covered in the metadata schema. Although the disconnection of the metadata task teams from the other task teams has not lead to an integrated model

    mbaudis commented 9 years ago

    And for a provenance chain, possibly across repositories, [gu]uid implementations are the IMHO most promising way to go. If a sample is tagged as derived from an object with a given uuid, one (should) know about the object's type. And if having pointers to multiple upstream objects, one can pick XOR merge data (e.g. based on object generation timestamp ...).

    But nothing will fully prevent duplicated data objects, based on the same "wet" object (besides a registry which obviously will be implemented by someone but should definitely not be required).

    mbaudis commented 9 years ago

    And re: @diekhans: YES! There has been a lot of preparation through metadata, but the MTT is a bit lost (sometimes in very specific discussions) if the data schema developers don't have a look / reach out / comment & criticise ...

    mbaudis commented 9 years ago
    lh3 commented 9 years ago

    Several points:

    1. We have always been keeping a form of the center-sample-library-run hierarchy in the BAM header and thus as ReadGroup properties. You have the option to detect batch effect if you want to. The information is not lost. It is just not of the same importance as Sample in #383. Think about 1000g. How often does this hierarchy really matter? We are at the risk of degrading user experiences if we expose too many experimental details most people don't care about.
    2. I see our mission is to find the lowest denominator of major projects in a domain (e.g. DNA variants for now), abstract it into a data model and then to make the data of these projects available through this model such that users can access them in a uniform way. We have sequenced >100k non-tumor genomes/exomes so far (I will touch TCGA a bit later). Their lowest denominator is very simple: a bunch of samples, reads sequenced from the samples and variants called from the reads. Making these available alone would tremendously benefit the community.
    3. In particular, I see we are trying to fit ourselves to the existing projects and to make the common available. We are not trying to, at least not trying hard to, advise projects to change their ways of doing things. Of course, this is my personal view only. It would be good to clarify the GA4GH view.
    4. In a refVar call, we have discussed why to develop ref server when it has the same model as VCF. The consensus that time was authorization. I don't fully agree with this consensus, but I think it is an important factor at least.
    5. For TCGA, I would model it this following way. We create a "TCGA-WG-somatic" dataset. We name samples like "HCC1187", "HCC1187-BL" (BL means blood sample), "HCC2218", "HCC2218-BL", etc and link normal-tumor pairs as a property of Sample in #383 (it does not have this field right now, but it is not hard to add). We store the alignments for each sample with center, library, machine etc as properties of ReadGroup -- 1000G does this as well. We keep merged somatic calls in one VariantSet composed of multiple tumor samples. Will this work for TCGA small somatic variants?
    6. At last, any specific suggestions on #383?
    mlin commented 9 years ago

    @dglazer re: ENCODE, their JSON schemas are here. I am cautious about the "parent" concept - I personally would not say a readgroupset is a "parent" of variant calls derived from it, but rather that it was an input to the analysis which generated the variant calls. Also it's not uncommon to need metadata about a genetic research subject's actual parents!

    @diekhans @lh3 I strongly agree there's a big problem in the lack of shared understanding of the envisioned uses for these schemas/APIs. In general I like the direction Heng is going, to simplify and focus on core data models, but use cases are even more important than data models to my mind. Getting >100K genomes behind these APIs would be awesome but I feel we ought to think more critically about how they'll then be used -- e.g. in genome browsers, variant callers, analytics like assocition studies, and the driver projects (ICGC, Matchmaker, BRCA Challenge) -- and whether they well serve these applications. Mark mentions and I have also questioned the need for a newfangled wire format for the actual reads and variants -- compare to ENCODE's server that lets you richly surf metadata but for the bulk data just dispenses URLs to files in common formats, a very practical and engineering-cost-effective approach IMHO. Another thread perhaps.

    jeromekelleher commented 9 years ago

    @dglazer This is a great start on writing down our datamodel. I agree with the majority of what you're saying. Recording the provenance of the steps taken as we move from primary data to the inferences we make is essential. There's one thing that I disagree with though:

    a variantset logically contains one or more columns of calls

    I don't think this matrix metaphor is especially helpful. Why say the VariantSet contains columns of calls rather than a VariantSet is a set of Variants? If we extract the 'column' of Calls from a VariantSet, it is largely meaningless: we don't know what position the Call is for, and we don't know what the substitution is. The aggregation of all the calls for a particular 'sample' (I'm using this in the sense of @lh3's #383 PR; 'NA12878' is a element of this set, whatever it is) isn't a useful thing to model separately in my view, it's just a property of the VariantSet.

    dglazer commented 9 years ago

    I just created PR #391, which maps out some of the schema changes we would need to adopt a provenance model like the one described here. It's not meant to be committed as is; the idea is to add concrete code to the abstract discussion.

    There are obvious similarities to #383, as well as a few intentional differences -- I hope that having both of these examples will help us come up with an even better proposal.

    dglazer commented 9 years ago

    @jeromekelleher , re the matrix metaphor -- I agree that a naked column of all the Calls from a Callset isn't meaningful (since we don't know the row-specific position and substitution info), but I'd say the same about a naked row of Calls from a Variant (since we don't know the column-specific origin and processing info). That's why I find the matrix metaphor so apt -- it lets us work with the properties that all Calls in a row share (i.e. the Variant data), and the properties that all Calls in a column share (i.e. the Callset data).

    Similarly, there are some use cases (e.g. how is this variant distributed in my population of interest?) that drive row-centric access, and others (e.g. does my patient have any interesting variants in this region?) that drive column-centric access. That's why I think both are needed.

    dglazer commented 9 years ago

    @mlin , re "parent" -- I think we agree on the core concept, which is that

    a readgroupset is ... an input to the analysis which generated the variant calls.

    We can describe that in several synonymous ways; if we agree they're synonymous, let's find which one(s) communicate best and use them. E.g.:

    • a column of calls is derived from a readgroupset
    • a readgroupset is the basis used to generate a column of calls
    • a readgroupset is the input; a column of calls is the output
    • a readgroupset is the parent; a column of calls is the child

    .

    dglazer commented 9 years ago

    @lh3 , re a few of your points:

    1) We have always been keeping a form of the center-sample-library-run hierarchy in the BAM header and thus as ReadGroup properties. You have the option to detect batch effect if you want to. The information is not lost. It is just not of the same importance as Sample in #383. Think about 1000g. How often does this hierarchy really matter? We are at the risk of degrading user experiences if we expose too many experimental details most people don't care about.

    If we're happy with the ad hoc provenance in today's BAM and VCF files, then I agree we don't need to make changes. If we want to add a formal provenance model, then we need to at least better document the intended behavior for handling today's ad hoc sample tags, and maybe go as far as the full model discussed here, incorporating whatever ideas we think best match that from #383 and #391.

    2). I see our mission is to find the lowest denominator of major projects in a domain (e.g. DNA variants for now), abstract it into a data model and then to make the data of these projects available through this model such that users can access them in a uniform way. We have sequenced >100k non-tumor genomes/exomes so far (I will touch TCGA a bit later). Their lowest denominator is very simple: a bunch of samples, reads sequenced from the samples and variants called from the reads. Making these available alone would tremendously benefit the community.

    I agree that would be useful. I think it's possible to represent that lowest denominator today, but only if we're okay with the ad hoc way that provenance is represented in source data, and we're okay documenting similar ad hoc conventions for the use of the sampleId field. I haven't checked, but I believe that Google's public copy of 1000g is indeed logically joinable by sampleId; it's just far from robust. (Although it's probably as robust as the source data.) If we want to do better, we need to agree on a model and then a schema.

    3) In particular, I see we are trying to fit ourselves to the existing projects and to make the common available. We are not trying to, at least not trying hard to, advise projects to change their ways of doing things. Of course, this is my personal view only. It would be good to clarify the GA4GH view.

    I completely agree that we need to strike a balance between over-fitting to the status quo (and therefore not advancing the state of the art) and over-requiring radical change (and therefore blocking adoption). These conversations are how we can find that balance.

    jeromekelleher commented 9 years ago

    @dglazer --- Fair enough. However, there are plenty cases where we want to get information about a variant without any Call data. If we just want to know about the variants that exist and their base compositions, then we don't want Calls at all. There is a strong asymmetry between the rows and columns of this metaphorical matrix... Anyway, this is off-topic, and getting away from the main point.

    jjfarrell commented 9 years ago

    @dglazer The core concept described earlier does capture the data flow found in joint genotyping to create a multi-sample analysis ready VCF (as found in the GATK pipeline https://www.broadinstitute.org/gatk/guide/bp_step.php?p=2).

    In the most recent GATK pipeline, a joint genotyped VCF is actually called from a set of GVCFs https://www.broadinstitute.org/gatk/guide/article?id=4017 (not directly from bam files or readgroupset). This helps solve both the scalability and the n+1 issue. When a new sample is sequenced, a GVCF is generated for the sample. A new joint genotyped vcf is then relatively quickly called on the n+1 GVCFs instead of the much slower method on n+1 bam files.

    So in this case, a callset contains multiple samples. This multi-sample callset undergoes further filtering and annotation to create a analysis ready call set.

    The quality of the genotypes in a joint genotype VCF will be much better than a VCF file based on one sample. So it is important to track that a set of calls across samples were made jointly instead of individually.

    On Mon, Aug 24, 2015 at 10:22 AM, David Glazer notifications@github.com wrote:

    @mlin https://github.com/mlin , re "parent" -- I think we agree on the core concept, which is that

    a readgroupset is ... an input to the analysis which generated the variant calls.

    We can describe that in several synonymous ways; if we agree they're synonymous, let's find which one(s) communicate best and use them. E.g.:

    • a column of calls is derived from a readgroupset
    • a readgroupset is the basis used to generate a column of calls
    • a readgroupset is the input; a column of calls is the output
    • a readgroupset is the parent; a column of calls is the child

    .

    — Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/390#issuecomment-134222817.

    John Farrell, Ph.D. IT Manager Biomedical Genetics-Evans 218 Boston University Medical School 72 East Concord Street Boston, MA

    ph: 617-638-5491

    mlin commented 9 years ago

    @dglazer Indeed my reservations about "parent" are largely terminological. I see one second-order distinction among these suggestions. Calling the relationship "input" and "output" seems to demand some sort of model of the analytic steps in between. "Derived from" is more neutral and vague which perhaps is appropriate at this point in time.

    mlin commented 9 years ago

    Here are a few illustrations of provenance tracking in DNAnexus presented for informational purposes.

    This is the BAM file for my genome. Let me click Info to see where it came from.

    It looks like this BAM file was the product of Picard MarkDuplicates in March, 2014. I can click through here to find out more about that job.

    Here we see that Picard MarkDuplicates was the final step in an analysis (workflow execution) that had begun with four read groups (FASTQ pairs). The Gantt chart shows the execution history, followed by a flat list of inputs and outputs- there are a couple different ways to see the connectivity of the individual jobs without showing a big hairball. All the objects are linked by unique ID and will have their own "created by", so an API client can follow the chain all the way back to the original data uploaded. The information is all recorded automatically.

    Observations:

    1. We don't have an explicit, required notion of "bioinformatics sample" or "SequencerInput". There are variety of things our customers do in practice (not trying to say what's ideal, just what we see):
      • Organize related data through old-fashioned folders and filenames.
      • Set identifiers in properties/tags of individual data objects or the top-level analysis
      • Set metadata links to a separate generic data object used to represent the sample/experiment/SequencerInput/whatever -- quite similar to the proposals in #383 and #391
      • Keep sample-level information in an external LIMS, with consequently limited interest in representing it in DNAnexus.
    2. There are intermediate products here in between "unaligned readgroups" and "aligned readgroupset" and people make different decisions about what to toss or archive. Similarly there can be important intermediates between "aligned readgroupset" and "variantset" e.g. gVCF.
    3. There are input objects to the analysis, besides the unaligned readgroupset for the aligned readgroupset and the aligned readrgoupset for the variantset, that are necessary for provenance and reproducibility. For example, the variant calling workflow I set up for my UYG data filters out variants in low-complexity regions, and a BED file listing these regions is an input. Others like to use GATK VQSR which will take other inputs like a dbSNP flat file. Still others will insist on their snowflake perl script! @jjfarrell also mentions the good point that the analysis generating a variantset might have >10,000 inputs.

    Trying to accommodate all of the above with a finite eng team has led us to the very execution-oriented provenance metadata system we have.

    diekhans commented 9 years ago

    The lack of any class supporting the variants will come from databases like ClinVar. Sadly, a lot of the provenance trails with reach null.

    Jerome Kelleher notifications@github.com writes:

    @dglazer --- Fair enough. However, there are plenty cases where we want to get information about a variant without any Call data. If we just want to know about the variants that exist and their base compositions, then we don't want Calls at all. There is a strong asymmetry between the rows and columns of this metaphorical matrix... Anyway, this is off-topic, and getting away from the main point.

    — Reply to this email directly or view it on GitHub.*

    dglazer commented 9 years ago

    @mlin, re terminology -- glad we're in sync on the concepts; I'm not attached to any particular wording; we can nail that down when we get to final comments for any schema changes.

    @mlin -- thanks for the DNAnexus provenance example; always helpful to see lessons from real-world implementations.

    @jjfarrell -- good point re joint variant calling. I think that would be easy to fit into this conceptual framework (by saying that one column-of-calls can be derived from several upstream columns-of-calls), but the details could get messy.

    @diekhans, @jeromekelleher -- if we want to discuss the variant / call relationship further, I suggest spinning up a new issue.