d3b-center / d3b-cds-manifest-prep

scripts to prep manifests for cds
Apache License 2.0
1 stars 0 forks source link

🌱 Add file-sample-participant mapping for the CBTN X01 #133

Closed chris-s-friedman closed 1 year ago

chris-s-friedman commented 1 year ago

🌱 Add file-sample-participant mapping for the CBTN X01

Adds file-sample-participant map for the CBTN X01. This has every file connected to a sample in the cbtn x01 bucket. This manifest is generated with the query in scripts/v1.x.x/build_fsp.py.

TODO: add validation that every file in the CDS X01 bucket is in this manifest and that every file in this manifest is actually in the CDS X01 bucket

chris-s-friedman commented 1 year ago

I've been performing some QC on these files:

  1. 🟩 Every file in the manifest is in the bucket
  2. 🛑 Not every file in the bucket is in the manifest. 75,584 files are in the bucket that are not in the manifest.

For item 2, this needs to be resolved but is also explainable:

  1. 1, of those files is an .sb.access file in the root of the bucket.
  2. 5,565 of those files are in the source prefix of the bucket. One of these files is the terra manifest, s3://cds-306-phs002517-x01/source/terra_manifest.tsv. The other 5,564 files are .md5 files.
  3. 70.018 files are in the harmonized file directory. Of these files: Count of keys in harmonized-data/copy-number-variations: 3093 Count of keys in harmonized-data/gene-expressions: 3513 Count of keys in harmonized-data/workflow-outputs: 3 Count of files in harmonized-data/workflow-outputs/ rnaseq-analysis: 14052 Count of files in harmonized-data/workflow-outputs/ somatic-mutations: 11136 Count of files in harmonized-data/workflow-outputs/ alignment: 2 Count of keys in harmonized-data/simple-variants: 29496 Count of keys in harmonized-data/structural-variations: 2870 Count of keys in harmonized-data/aligned-reads: 2342 Count of keys in harmonized-data/gene-fusions: 3514

TODO: see if these harmonized files are in the harmonization manifest from the bioinformatics unit or not and get counts on file types.

chris-s-friedman commented 1 year ago

All files in the bucket are now accounted for. Note that there are 6902 harmonized files that need to be deleted from the bucket, as well as two files in the source directory, .sb.access, and s3://cds-306-phs002517-x01/source/terra_manifest.tsv

Tickets for these deletes are here:

DEVOPS-1156 DEVOPS-1251 DEVOPS-1290 DEVOPS-1374

baileyckelly commented 1 year ago

I've approved 1 of the 4 tickets. The other 3, I need some more documentation from bix. Will send over to them.