billingross / trellis-v2

MIT License
0 stars 0 forks source link

Transition to standardized naming conventions for cloud storage #39

Open billingross opened 1 month ago

billingross commented 1 month ago

fastq-to-ubam naming convention for output:

/${sample}/HAS_READ_GROUP/{readGroup}/HAS_FASTQ/${jobRequestId}_${sample}_${readGroup}.ubam

1000 Genomes Fastq name:

ERR3239279_1.fastq.gz
billingross commented 1 month ago

European Nucleotide Archive organizes sequencing data according to the following concepts:

billingross commented 1 month ago

I want to organize data in a way that maximizes functionality. Example storage name:

mvp_phase3/HAS_SAMPLE/SHIP123/HAS_READ_GROUP/1/WAS_USED_BY/PROTOCOL_7/GENERATED/SHIP123_R1.fastq.gz

It's kind of long (maximum GCS length = 1024 bytes. And we aren't going to use the protocol to differentiate samples. I could just drop the protocol and the project, frankly. We can use separate buckets for separate projects.

SHIP123/HAS_READ_GROUP/1/HAS_FASTQ/SHIP123_R1.fastq.gz
SHIP123/HAS_READ_GROUP/1/HAS_FASTQ/SHIP123_R2.fastq.gz
SHIP123/sample.json
SHIP123/HAS_READ_GROUP
SHIP123/HAS_READ_GROUP/1
SHIP123/HAS_READ_GROUP/2
billingross commented 1 month ago

How would I organize European Nucleotide Archive data?

Provenance model:

{sample}/WAS_USED_BY/{experiment}/GENERATED/{library}/WAS_USED_BY/{sequencing_run}/GENERATED/{fastq}
SAMN00797054/WAS_USED_BY/ERX3266654/GENERATED/NA07000/WAS_USED_BY/ERR4186945/GENERATED/ERR4186945_1.fastq.gz

Functional model:

{sample}/HAS_READ_GROUP/{sequencing_run}/HAS_FASTQ/{fastq}
SAMN00797054/HAS_READ_GROUP/ERR4186945/HAS_FASTQ/ERR4186945_1.fastq.gz

What is the relationship between these two models? How are they similar; how are they different?

Modelling Ubams

Provenance model:

{sample}/WAS_USED_BY/{experiment}/GENERATED/{library}/WAS_USED_BY/{sequencing_run}/GENERATED/{fastq}/WAS_USED_BY/{job-id}/GENERATED/{ubam}

Functional model:

{sample}/HAS_READ_GROUP/sequencing_run/HAS_UBAM/{ubam}

Functional model is easier to program jobs around because it is more consistent and less dependent on how the data was generated. But the provenance of a thing indicates the function of a thing. But also the functional model is just simpler.

billingross commented 1 month ago

Principles of the functional model