Why? As a user of the DCP, I expect the data that is in the DCP to be as standardized as possible so I can easily analyze it, despite it being produced by pipelines that may or may not use the same tools/conform to standards on outputs.
This implies details like:
That the QC metrics that we output from our pipelines are all named the same if they are the same metric and are organized and stored in expected ways.
Our bams need to be consistent with the metadata in their headers and naming conventions, and
Our cell matrices need to be consistent in metadata that is included, col / row headers, just to name a few.
The matrix service, unity, data browser, and the users depend on us to provide consistent outputs.
Where to start: See an initial spike into this work on our QC outputs here, official documentation on our QC metrics here, and this spreadsheet documenting all the outputs from Optimus and SS2. Work with matrix service (Marcus Kinsella) for feedback, testing.
ACs:
Users have quick and consistent access to the most useful metrics output by all of our pipelines.
A list of requirements for all outputs of all pipelines that run in the HCA DCP with small examples (i.e. all bams will be named xyz.zyx.bam)
Our pipelines produce identically formatted outputs wherever possible and meet those requirements
Validation tools that confirm that outputs meet that spec that would work for us in development, can be expanded on easily, could be used by others in EE, and can be run by ingestion on the outputs coming out of our pipelines. Can be used in CI compliance tests.
Documentation on all of these validation tools
There were some ideas from that work:
Create a metrics package that through modularity and extensibility supports mapping future pipelines onto a common schema
Allow exchange of output format types, to support changes in downstream file type requests
Why? As a user of the DCP, I expect the data that is in the DCP to be as standardized as possible so I can easily analyze it, despite it being produced by pipelines that may or may not use the same tools/conform to standards on outputs.
This implies details like:
The matrix service, unity, data browser, and the users depend on us to provide consistent outputs.
Where to start: See an initial spike into this work on our QC outputs here, official documentation on our QC metrics here, and this spreadsheet documenting all the outputs from Optimus and SS2. Work with matrix service (Marcus Kinsella) for feedback, testing.
ACs:
xyz.zyx.bam
)There were some ideas from that work:
┆Issue is synchronized with this Jira Epic