Pipeline Provenance Schema

In order to confidently determine how a genomics analysis result was obtained, we need to collect information about the processes that were used to generate the analysis outputs. Most genomics analyses are performed using pipelines that consist of several discrete analysis steps. Each step may use a specific version of an analysis tool, and produce some outputs that are carried forward into later analysis steps.

The metadata that is collected about the process of generating a dataset is called data provenance information. The schema provided in this repo is intended to specify and validate the format of data provenance information for genomic analysis pipelines.

There are other existing projects that aim to specify this sort of data provenance information, notably Research Object and cwlprov. Those projects are likely more robust and complete than this schema, but they come with some complexity.

The purpose of this schema is to define a fairly simple standardized format for genomic analysis pipeline provenance data generated by the pipelines produced by the BCCDC-PHL. It is intended to serve as a standardized interface between pipeline designers/maintainers, and developers who will be parsing and collecting outputs from BCCDC-PHL genomics analysis pipelines.

Development Status

This schema is currently in draft status. It may change at any time. Releases will be tagged once some stability has been achived.

BCCDC-PHL / pipeline-provenance-schema

readme

Pipeline Provenance Schema

Development Status