ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

Provenance #240

Open jacmarjorie opened 9 years ago

jacmarjorie commented 9 years ago

The idea of provenance came up in the G2P meeting today, and I thought it would be a good idea to open up a thread for conversation around this. The idea of standardizing provenance deserves a lot of thought and fits into several task teams. Should there be thought of developing a provenance task team?

One direct example would be the idea of querying on provenance to gain insight into commonly used pipelines. If a data management system (i.e. Synapse, LabKey) is storing provenance, and doing so in a GA4GH compliant manor, API calls could be used across multiple data management systems to generate statistics about most commonly used tools for an analysis pipeline. This could provide insight into the best analysis workflows out there, and work towards the standardization of such.

max-biodatomics commented 9 years ago

Hi Jacmarjorie,

We have a discussion on data provenance in Containers and Execution group. I think provenance must be part of those group.

Max

jacmarjorie commented 9 years ago

Max, will you point me to this discussion? Thanks.

diekhans commented 9 years ago

Provenance and identification are GA4GH wide topics. While the work the containers group is doing is important, particularly digests, tracking must extend beyond any what is or can be done in any recommended container.

Provenance tracking must cover all data and metadata. A change in the metadata can drastically change the interpenetration of the data.

Also, GA4GH's mission is data sharing. Any provenance information has to be independent of particular software implementations.

Perhaps you can present on a metadata call and we can start coordinating? We very much need to pay attention to provenance tracking across all of the task teams.

Mark

MAX notifications@github.com writes:

Hi Jacmarjorie,

We have a discussion on data provenance in Containers and Execution group. I think provenance must be part of those group.

Max

— Reply to this email directly or view it on GitHub.*

max-biodatomics commented 9 years ago

I am sorry, I missed the most recent e-mail. The containers group is on early stages of developing standards yet. The discussions which we had is a information about tools, parameters and binaries. For example it you adding metadata about tool you need to specify a version. Some times even building information for tools. The main consensus which were reached that we will use a Docker containers for binaries distribution where is possible (unfortunately not everywhere). The Docker has a unique hash code for containers.

So, the metadata for each file should contain information on tool, version, all parameters and docker image.