Closed achave11-ucsc closed 1 year ago
Spike to determine if we are justified (based on schema) to assume that read_index
is present in every FASTQ file.
HCA files are described by one of the following schemas:
FASTQ is a sequence data format, so we would expect all FASTQ files to use the sequence_file
schema. read_index
is indeed a mandatory field in that schema.
However, I found a snapshot that contains FASTQ files which use the analysis_file
schema instead. The read_index
field is not mentioned in that schema, and it is missing from these files.
select analysis_file_id, content
from `datarepo-2917ceb6.hca_prod_5b3285614a9740acb7ad6a90fc59d374__20220117_dcp2_20230314_dcp25.analysis_file`
where json_extract_scalar(content, '$.file_core.format') = 'fastq.gz'
[{
"analysis_file_id": "4187f801-a030-45e7-9186-0162fb052253",
"content": "{\"describedBy\":\"https://schema.humancellatlas.org/type/file/6.3.0/analysis_file\",\"schema_type\":\"file\",\"file_core\":{\"file_name\":\"AN6698_UMI2.fastq.gz\",\"format\":\"fastq.gz\",\"content_description\":[{\"text\":\"DNA sequence (raw)\",\"ontology\":\"data:3494\",\"ontology_label\":\"DNA sequence\"}],\"file_source\":\"Publication\"},\"provenance\":{\"document_id\":\"4187f801-a030-45e7-9186-0162fb052253\",\"submission_date\":\"2021-11-22T11:13:37.613Z\",\"update_date\":\"2021-11-30T13:52:33.834Z\",\"schema_major_version\":6,\"schema_minor_version\":3}}"
}, {
"analysis_file_id": "53a287f3-4662-402a-8822-b8c4050a5a62",
"content": "{\"describedBy\":\"https://schema.humancellatlas.org/type/file/6.3.0/analysis_file\",\"schema_type\":\"file\",\"file_core\":{\"file_name\":\"AP1864_UMI2.fastq.gz\",\"format\":\"fastq.gz\",\"content_description\":[{\"text\":\"DNA sequence (raw)\",\"ontology\":\"data:3494\",\"ontology_label\":\"DNA sequence\"}],\"file_source\":\"Publication\"},\"provenance\":{\"document_id\":\"53a287f3-4662-402a-8822-b8c4050a5a62\",\"submission_date\":\"2021-11-22T11:13:37.671Z\",\"update_date\":\"2021-11-30T13:52:33.859Z\",\"schema_major_version\":6,\"schema_minor_version\":3}}"
}, {
"analysis_file_id": "5f3614cd-074b-4bfc-a7c8-1982c574150e",
"content": "{\"describedBy\":\"https://schema.humancellatlas.org/type/file/6.3.0/analysis_file\",\"schema_type\":\"file\",\"file_core\":{\"file_name\":\"AP1865_UMI2.fastq.gz\",\"format\":\"fastq.gz\",\"content_description\":[{\"text\":\"DNA sequence (raw)\",\"ontology\":\"data:3494\",\"ontology_label\":\"DNA sequence\"}],\"file_source\":\"Publication\"},\"provenance\":{\"document_id\":\"5f3614cd-074b-4bfc-a7c8-1982c574150e\",\"submission_date\":\"2021-11-22T11:13:37.691Z\",\"update_date\":\"2021-11-30T13:52:33.864Z\",\"schema_major_version\":6,\"schema_minor_version\":3}}"
}, {
"analysis_file_id": "bf9e2f20-c74c-47d5-8d47-a9b01e0302ec",
"content": "{\"describedBy\":\"https://schema.humancellatlas.org/type/file/6.3.0/analysis_file\",\"schema_type\":\"file\",\"file_core\":{\"file_name\":\"AN6700_UMI2.fastq.gz\",\"format\":\"fastq.gz\",\"content_description\":[{\"text\":\"DNA sequence (raw)\",\"ontology\":\"data:3494\",\"ontology_label\":\"DNA sequence\"}],\"file_source\":\"Publication\"},\"provenance\":{\"document_id\":\"bf9e2f20-c74c-47d5-8d47-a9b01e0302ec\",\"submission_date\":\"2021-11-22T11:13:37.641Z\",\"update_date\":\"2021-11-30T13:52:33.845Z\",\"schema_major_version\":6,\"schema_minor_version\":3}}"
}, {
"analysis_file_id": "e30fa256-e9e3-4265-a7ba-8a2ab05caa34",
"content": "{\"describedBy\":\"https://schema.humancellatlas.org/type/file/6.3.0/analysis_file\",\"schema_type\":\"file\",\"file_core\":{\"file_name\":\"AN6699_UMI2.fastq.gz\",\"format\":\"fastq.gz\",\"content_description\":[{\"text\":\"DNA sequence (raw)\",\"ontology\":\"data:3494\",\"ontology_label\":\"DNA sequence\"}],\"file_source\":\"Publication\"},\"provenance\":{\"document_id\":\"e30fa256-e9e3-4265-a7ba-8a2ab05caa34\",\"submission_date\":\"2021-11-22T11:13:37.627Z\",\"update_date\":\"2021-11-30T13:52:33.839Z\",\"schema_major_version\":6,\"schema_minor_version\":3}}"
}, {
"analysis_file_id": "ee268446-370f-4584-948e-c07869165db2",
"content": "{\"describedBy\":\"https://schema.humancellatlas.org/type/file/6.3.0/analysis_file\",\"schema_type\":\"file\",\"file_core\":{\"file_name\":\"AP1863_UMI2.fastq.gz\",\"format\":\"fastq.gz\",\"content_description\":[{\"text\":\"DNA sequence (raw)\",\"ontology\":\"data:3494\",\"ontology_label\":\"DNA sequence\"}],\"file_source\":\"Publication\"},\"provenance\":{\"document_id\":\"ee268446-370f-4584-948e-c07869165db2\",\"submission_date\":\"2021-11-22T11:13:37.656Z\",\"update_date\":\"2021-11-30T13:52:33.852Z\",\"schema_major_version\":6,\"schema_minor_version\":3}}"
}]
This is the only case in dcp31
of FASTQ files that are not sequence_file
s.
This is project ID 5b328561-4a97-40ac-b7ad-6a90fc59d374
, added to dcp25
in March of this year.
Assignee to contact wranglers about this.
Spike to 1) narrow reproduction (reduce the filters) and 2) to see if other manifest formats are also impacted.
Reproduce by running:
http 'https://service.azul.data.humancellatlas.org/fetch/manifest/files?filters={"fileName":{"is":["AP1865_UMI2.fastq.gz"]}}&format=terra.bdbag'
The formats compact
, curl
and terra.pfb
are not impacted by this issue.
Spike for next steps.
@hannes-ucsc: "Wranglers got back to me, reporting that this was an error. They intend to prepare a replacement. Until then, we will remove the offending project from dcp32
using pop
. Since we will soon switch to dcp32
as the default, we don't need to worry about dcp31
"
For demo, attempt to reproduce with dcp31 (if present) and dcp32. It should be reproducible in the former but not in the latter. Use the original reproduction.
… for the following request:
This caused an alarm notification. CW logs indicate that this is due to a KeyError.
Original StepF exec: azul-manifest-prod:3dcb2e4a-c485-4476-be4b-5fe3839e7a97 Repro StepF exec: azul-manifest-prod:6480e84b-c2ea-4a42-be9e-372368ff9775