billingross commented 1 month ago

fastq-to-ubam naming convention for output:

/${sample}/HAS_READ_GROUP/{readGroup}/HAS_FASTQ/${jobRequestId}_${sample}_${readGroup}.ubam

1000 Genomes Fastq name:

ERR3239279_1.fastq.gz

billingross commented 1 month ago

European Nucleotide Archive organizes sequencing data according to the following concepts:

Study (1000 Genomes Phase 3)
Sample (Biological sample)
Experiment (Library preparation)
Run (Sequencing)

billingross commented 1 month ago

I want to organize data in a way that maximizes functionality. Example storage name:

mvp_phase3/HAS_SAMPLE/SHIP123/HAS_READ_GROUP/1/WAS_USED_BY/PROTOCOL_7/GENERATED/SHIP123_R1.fastq.gz

It's kind of long (maximum GCS length = 1024 bytes. And we aren't going to use the protocol to differentiate samples. I could just drop the protocol and the project, frankly. We can use separate buckets for separate projects.

SHIP123/HAS_READ_GROUP/1/HAS_FASTQ/SHIP123_R1.fastq.gz
SHIP123/HAS_READ_GROUP/1/HAS_FASTQ/SHIP123_R2.fastq.gz

SHIP123/sample.json
SHIP123/HAS_READ_GROUP
SHIP123/HAS_READ_GROUP/1
SHIP123/HAS_READ_GROUP/2

billingross commented 1 month ago

How would I organize European Nucleotide Archive data?

Provenance model:

{sample}/WAS_USED_BY/{experiment}/GENERATED/{library}/WAS_USED_BY/{sequencing_run}/GENERATED/{fastq}
SAMN00797054/WAS_USED_BY/ERX3266654/GENERATED/NA07000/WAS_USED_BY/ERR4186945/GENERATED/ERR4186945_1.fastq.gz

Functional model:

{sample}/HAS_READ_GROUP/{sequencing_run}/HAS_FASTQ/{fastq}
SAMN00797054/HAS_READ_GROUP/ERR4186945/HAS_FASTQ/ERR4186945_1.fastq.gz

What is the relationship between these two models? How are they similar; how are they different?

Functional model is more concise because it ignores the analyses and is just focussed on the data

Modelling Ubams

Provenance model:

{sample}/WAS_USED_BY/{experiment}/GENERATED/{library}/WAS_USED_BY/{sequencing_run}/GENERATED/{fastq}/WAS_USED_BY/{job-id}/GENERATED/{ubam}

Functional model:

{sample}/HAS_READ_GROUP/sequencing_run/HAS_UBAM/{ubam}

Functional model is easier to program jobs around because it is more consistent and less dependent on how the data was generated. But the provenance of a thing indicates the function of a thing. But also the functional model is just simpler.

billingross commented 1 month ago

Principles of the functional model

Organized around the data
Designed to specify job inputs based on complete object paths or path patterns. Not around other data structures like lists

billingross / trellis-v2

Transition to standardized naming conventions for cloud storage #39

How would I organize European Nucleotide Archive data?

Provenance model:

Functional model:

Modelling Ubams

Provenance model:

Functional model: