Closed chris-s-friedman closed 1 year ago
I've been performing some QC on these files:
75,584
files are in the bucket that are not in the manifest.For item 2, this needs to be resolved but is also explainable:
1
, of those files is an .sb.access
file in the root of the bucket. 5,565
of those files are in the source prefix of the bucket. One of these files is the terra manifest, s3://cds-306-phs002517-x01/source/terra_manifest.tsv
. The other 5,564
files are .md5
files.70.018
files are in the harmonized file directory.
Of these files:
Count of keys in harmonized-data/copy-number-variations
: 3093
Count of keys in harmonized-data/gene-expressions
: 3513
Count of keys in harmonized-data/workflow-outputs
: 3
Count of files in harmonized-data/workflow-outputs/ rnaseq-analysis
: 14052
Count of files in harmonized-data/workflow-outputs/ somatic-mutations
: 11136
Count of files in harmonized-data/workflow-outputs/ alignment
: 2
Count of keys in harmonized-data/simple-variants
: 29496
Count of keys in harmonized-data/structural-variations
: 2870
Count of keys in harmonized-data/aligned-reads
: 2342
Count of keys in harmonized-data/gene-fusions
: 3514TODO: see if these harmonized files are in the harmonization manifest from the bioinformatics unit or not and get counts on file types.
All files in the bucket are now accounted for. Note that there are 6902 harmonized files that need to be deleted from the bucket, as well as two files in the source directory, .sb.access
, and s3://cds-306-phs002517-x01/source/terra_manifest.tsv
Tickets for these deletes are here:
I've approved 1 of the 4 tickets. The other 3, I need some more documentation from bix. Will send over to them.
🌱 Add file-sample-participant mapping for the CBTN X01
Adds file-sample-participant map for the CBTN X01. This has every file connected to a sample in the cbtn x01 bucket. This manifest is generated with the query in
scripts/v1.x.x/build_fsp.py
.TODO: add validation that every file in the CDS X01 bucket is in this manifest and that every file in this manifest is actually in the CDS X01 bucket