d3b-center / d3b-cds-manifest-prep

scripts to prep manifests for cds
Apache License 2.0
1 stars 0 forks source link

Submission Package bug - Sequencing File Information is Missing #134

Closed chris-s-friedman closed 1 year ago

chris-s-friedman commented 1 year ago

Describe the bug

Sequencing Information is incomplete for some files to be submitted in the second CDS release.

There are 8756 unique sequencing experiments associated with files being submitted.

The export from the dataservice with information about each of these experiemnts is here.

Platform

Accepted values:

AB Capillary ABI Solid BGISEQ Complete Genomics Helicos Illumina Ion Torrent LS 454 Oxford Nanopore PacBio SMRT

Actual Values

platform count
Illumina 8503
Not Reported 242
Other 11

The issue is with the last two platforms. We need to decide what platform these experiments were performed on.

The 11 experiments where platform is other are all rna-seq samples, where the instrument model is DNBSeq that were sequenced at BGI.

@chris-s-friedman to get the platform for the above from bix

For the 242, their compostion of strategy, instrument model, and sequencing center is below. Note that none of these experiments have a value for instrument model.

library_strategy instrument_model sequencing_center_id sequencing center name count
RNA-Seq Not Reported SC_2ZBAMKK0 Novogene 81
WGS Not Reported SC_2ZBAMKK0 Novogene 131
WGS Not Reported SC_FAD4KCQG BGI 15
WGS Not Reported SC_N1EVHSME NantOmics 10
WGS Not Reported SC_WWEQ9HFY BGI@CHOP Genome Center 5

@chris-s-friedman to look through past files to get previously investigated platform

Instrument Model

Actual Values

Instrument Model Count
Not Reported 5838
HiSeq 1809
HiSeq X 1007
Novaseq 6000 91
DNBSeq 11

None of these instrument models are accepted values in their data model

Neither HiSeq or HiSeq X are accepted values, but they do have values for HiSeq X Five and HiSeq X Ten.

There is no Novaseq instrument model in their enumerated values.

There is no DNBSeq instrument model in their enumerated values.

@baileyckelly to ask ccdi if these values above are acceptable

Of the Not Reported instrument models:

  1. 199 experiments are cbtn experiments from pre-x01
  2. 76 experiments are pnoc 003/008 experiments created before february 2023
  3. 5449 experiments are from cbtn x01
  4. 40 experiments are pnoc 003/008 experiments on 2/6/2023 and 2/8/2023 that look to be associated with cbtn x01
  5. 74 experiments are associated with cbtn x01 under the study ID SD_8C478S85, High Incidence of Pediatric CNS Tumors, D3B-PCNST.

Items 1 and 2 will need some further investigation.

3, 4, and 5 are all from the cbtn x01 and should all have similiar instrument models.

Library Selection

For RNA-Seq samples, this is missing for all pre-x01 data For WGX, WXS, and Targeted Capture, this is missing for pre-x01 data and x01 data


From the metadata template:

For sequencing files, please try to provide all metadata, if applicable, for the following properties: avg_read_length, number_of_reads, number_of_bp, coverage

Number of Reads

missing for 3192 experiments. All pre x01

Mean read length

missing for 3192 experiments. All pre x01

Coverage

Missing for all experiments

number of bp

missing for all experiments

Expected behavior

No response

Version ID

None

Effected file(s)

chris-s-friedman commented 1 year ago

Closing for a jira ticket