Open billingross opened 1 month ago
European Nucleotide Archive organizes sequencing data according to the following concepts:
I want to organize data in a way that maximizes functionality. Example storage name:
mvp_phase3/HAS_SAMPLE/SHIP123/HAS_READ_GROUP/1/WAS_USED_BY/PROTOCOL_7/GENERATED/SHIP123_R1.fastq.gz
It's kind of long (maximum GCS length = 1024 bytes. And we aren't going to use the protocol to differentiate samples. I could just drop the protocol and the project, frankly. We can use separate buckets for separate projects.
SHIP123/HAS_READ_GROUP/1/HAS_FASTQ/SHIP123_R1.fastq.gz
SHIP123/HAS_READ_GROUP/1/HAS_FASTQ/SHIP123_R2.fastq.gz
SHIP123/sample.json
SHIP123/HAS_READ_GROUP
SHIP123/HAS_READ_GROUP/1
SHIP123/HAS_READ_GROUP/2
{sample}/WAS_USED_BY/{experiment}/GENERATED/{library}/WAS_USED_BY/{sequencing_run}/GENERATED/{fastq}
SAMN00797054/WAS_USED_BY/ERX3266654/GENERATED/NA07000/WAS_USED_BY/ERR4186945/GENERATED/ERR4186945_1.fastq.gz
{sample}/HAS_READ_GROUP/{sequencing_run}/HAS_FASTQ/{fastq}
SAMN00797054/HAS_READ_GROUP/ERR4186945/HAS_FASTQ/ERR4186945_1.fastq.gz
What is the relationship between these two models? How are they similar; how are they different?
{sample}/WAS_USED_BY/{experiment}/GENERATED/{library}/WAS_USED_BY/{sequencing_run}/GENERATED/{fastq}/WAS_USED_BY/{job-id}/GENERATED/{ubam}
{sample}/HAS_READ_GROUP/sequencing_run/HAS_UBAM/{ubam}
Functional model is easier to program jobs around because it is more consistent and less dependent on how the data was generated. But the provenance of a thing indicates the function of a thing. But also the functional model is just simpler.
Principles of the functional model
fastq-to-ubam naming convention for output:
1000 Genomes Fastq name: