ACED-IDP / devops

Playbooks and scripts
1 stars 0 forks source link

data release conventions #4

Open bwalsh opened 1 year ago

bwalsh commented 1 year ago

data releases

use case

As an engineer, I need to know the sources, provenance and locations of all data in a predictable manner. I need to store all of the above in a cold storage archive. It should be discoverable, identify all relative and then know how to parse and load it into an active database.

MUSTS

SHOULDS

* File listing including:
    * provenance meta data see https://github.com/DLR-SC/gitlab2prov

EXAMPLE

├── README.md
├── file-Patient.ndjson.gz
├── file-Specimen.ndjson.gz
├── file-Task.ndjson.gz
└── sub-dir
    ├── file-DocumentReference.ndjson.gz
    ├── file-Observation.ndjson.gz
    └── file-Compound.ndjson.gz

"An iceberg's calf"

Would have an manifest.yaml


id: unique
name: 
author: email
version: semantic
related-to:

tags: []
schema:
    - url: http://some-publically-readable-url
      # embedded copy
      data: {}
source:
    # all files extracted from this source
    - url:
    # with this provenance
    - provenance: {}
code:
    # all files created with this provenance
    - provenance: {}
files:
    - name: file-Patient.ndjson.gz
      md5: XXXX
      # except this one
      code_provenance: {}
      source_provenance: {}
    - name: file-Specimen.ndjson.gz
      md5: XXXX
      tags: []
    - name: file-Patient.ndjson.gz
      md5: XXXX
      tags: []
    - name: sub-dir/file-DocumentReference.ndjson.gz
      md5: XXXX
      tags: []
    - name: sub-dir/file-Observation.ndjson.gz
      md5: XXXX
      tags: []
    - name: sub-dir/file-Compound.ndjson.gz
      md5: XXXX
      tags: []