DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

FASTQ files described as analysis_file instead of sequence_file #5606

Closed achave11-ucsc closed 1 year ago

achave11-ucsc commented 1 year ago

… for the following request:

http 'https://service.azul.data.humancellatlas.org/fetch/manifest/files?filters={"genusSpecies":{"is":["Homo sapiens"]},"fileFormat":{"is":[".bed.gz",".csv",".h5",".h5ad",".mtx",".mtx.gz",".rds",".tsv",".tsv.gz",".txt",".txt.gz","asc","bai","bam","bed.gz","cloupe","csv","csv.gz","fastq","fastq.gz","fq","fq.gz","gz","h5","h5.gz","h5ad","h5ad.gz","h5ad.tar.gz","h5ad.zip","jpg","jpg.gz","json","json.gz","loom","loom.gz","md5","mtx","mtx.gz","nd2","ndpi","pkc.gz","png","png.gz","RAW.tar","RData","RData.gz","Rdata.gz","rds","RDS","Rds","rds.gz","Rds.gz","RDS.gz","Robj","Robj.gz","son.gz","tab","tar","tar.gz","tif","tif.gz","tiff","tsv","tsv.gz","txt","txt.gz","xls.gz","xlsx","zip","zip.gz"]}}&catalog=dcp31&format=terra.bdbag'
HTTP/1.1 200 OK
Access-Control-Allow-Headers: Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key
Access-Control-Allow-Origin: *
Connection: keep-alive
Content-Length: 697
Content-Type: application/json
Date: Tue, 10 Oct 2023 16:19:25 GMT
Strict-Transport-Security: max-age=31536000; includeSubDomains
Via: 1.1 538a08eba98551a196e344df4d0dda06.cloudfront.net (CloudFront)
X-Amz-Cf-Id: N3RPPgGlWt8axcrqB3JjR3rryDC8O0SfVIqO0SgsI9AmXG2Z3fSS8A==
X-Amz-Cf-Pop: LAX50-C1
X-Amzn-Trace-Id: Root=1-6525797e-2aade9383dd1abec0183182a;Sampled=0;lineage=2808a99a:0
X-Cache: Miss from cloudfront
x-amz-apigw-id: Ml_r2EoGoAMFkTg=
x-amzn-RequestId: 660cb271-b58a-4339-aa63-5d7c46204dc1

{
    "CommandLine": {
        "bash": "curl 'https://service.azul.data.humancellatlas.org/fetch/manifest/files?token=eyJleGVjdXRpb25faWQiOiAiNjQ4MGU4NGItYzJlYS00YTQyLWJlOWUtMzcyMzY4ZmY5Nzc1IiwgInJlcXVlc3RfaW5kZXgiOiAwLCAid2FpdF90aW1lIjogMX0%3D'",
        "cmd.exe": "curl.exe \"https://service.azul.data.humancellatlas.org/fetch/manifest/files?token=eyJleGVjdXRpb25faWQiOiAiNjQ4MGU4NGItYzJlYS00YTQyLWJlOWUtMzcyMzY4ZmY5Nzc1IiwgInJlcXVlc3RfaW5kZXgiOiAwLCAid2FpdF90aW1lIjogMX0%3D\""
    },
    "Location": "https://service.azul.data.humancellatlas.org/fetch/manifest/files?token=eyJleGVjdXRpb25faWQiOiAiNjQ4MGU4NGItYzJlYS00YTQyLWJlOWUtMzcyMzY4ZmY5Nzc1IiwgInJlcXVlc3RfaW5kZXgiOiAwLCAid2FpdF90aW1lIjogMX0%3D",
    "Retry-After": 1,
    "Status": 301
}
…
http 'https://service.azul.data.humancellatlas.org/fetch/manifest/files?token=eyJleGVjdXRpb25faWQiOiAiNjQ4MGU4NGItYzJlYS00YTQyLWJlOWUtMzcyMzY4ZmY5Nzc1IiwgInJlcXVlc3RfaW5kZXgiOiAwLCAid2FpdF90aW1lIjogMX0%3D'
HTTP/1.1 500 Internal Server Error
Access-Control-Allow-Headers: Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key
Access-Control-Allow-Origin: *
Connection: keep-alive
Content-Length: 782
Content-Type: text/plain
Date: Tue, 10 Oct 2023 16:22:52 GMT
Strict-Transport-Security: max-age=31536000; includeSubDomains
Via: 1.1 ff59c1cd74c841ab9a3ebd5370e3b24a.cloudfront.net (CloudFront)
X-Amz-Cf-Id: KleT1ulXtmFQdiMMUNi2SGUqti6VYB6rM-pcxioTdKr86kAClxBvoA==
X-Amz-Cf-Pop: LAX50-C1
X-Amzn-Trace-Id: Root=1-65257a5c-701b2d5b4ed8a9955963012c;Sampled=0;lineage=2808a99a:0
X-Cache: Error from cloudfront
x-amz-apigw-id: MmAOgHCfIAMF6FA=
x-amzn-RequestId: 83945c47-be64-4b55-8385-c19b998f0980

Traceback (most recent call last):
  File "/var/task/chalice/app.py", line 1913, in _get_view_function_response
    response = view_function(**function_args)
  File "/var/task/app.py", line 1306, in fetch_file_manifest
    return _file_manifest(fetch=True)
  File "/var/task/app.py", line 1328, in _file_manifest
    return app.manifest_controller.get_manifest_async(self_url=app.self_url,
  File "/var/task/azul/service/manifest_controller.py", line 148, in get_manifest_async
    token_or_state = self.async_service.inspect_generation(token)
  File "/var/task/azul/service/async_manifest_service.py", line 100, in inspect_generation
    raise StateMachineError(status, output)
azul.service.async_manifest_service.StateMachineError: ('Failed to generate manifest', 'FAILED', None)

This caused an alarm notification. CW logs indicate that this is due to a KeyError.

[ERROR] KeyError: 'read_index'
Traceback (most recent call last):
  File "/var/task/azul/chalice.py", line 166, in patched_event_source_handler
    return old_handler(self_, event, context)
  File "/var/task/chalice/app.py", line 1752, in __call__
    return self.handler(event_obj)
  File "/var/task/chalice/app.py", line 1708, in __call__
    return self.handler(request, self.next_handler)
  File "/var/task/azul/chalice.py", line 191, in _lambda_context_middleware
    return get_response(event)
  File "/var/task/chalice/app.py", line 1698, in __call__
    return self._original_func(event.to_dict(), event.context)
  File "/var/task/app.py", line 1339, in generate_manifest
    return app.manifest_controller.get_manifest(event)
  File "/var/task/azul/service/manifest_controller.py", line 80, in get_manifest
    result = self.service.get_manifest(format_=ManifestFormat(state['format_']),
  File "/var/task/azul/service/manifest_service.py", line 378, in get_manifest
    partition = generator.write(object_key, partition)
  File "/var/task/azul/service/manifest_service.py", line 1068, in write
    file_path, base_name = self.create_file()
  File "/var/task/azul/service/manifest_service.py", line 1523, in create_file
    self._samples_tsv(samples_tsv)
  File "/var/task/azul/service/manifest_service.py", line 1616, in _samples_tsv
    qualifier = f"fastq_{file['read_index']}"

Original StepF exec: azul-manifest-prod:3dcb2e4a-c485-4476-be4b-5fe3839e7a97 Repro StepF exec: azul-manifest-prod:6480e84b-c2ea-4a42-be9e-372368ff9775

achave11-ucsc commented 1 year ago

Spike to determine if we are justified (based on schema) to assume that read_index is present in every FASTQ file.

nadove-ucsc commented 1 year ago

HCA files are described by one of the following schemas:

FASTQ is a sequence data format, so we would expect all FASTQ files to use the sequence_file schema. read_index is indeed a mandatory field in that schema.

However, I found a snapshot that contains FASTQ files which use the analysis_file schema instead. The read_index field is not mentioned in that schema, and it is missing from these files.

select analysis_file_id, content
from `datarepo-2917ceb6.hca_prod_5b3285614a9740acb7ad6a90fc59d374__20220117_dcp2_20230314_dcp25.analysis_file`
where json_extract_scalar(content, '$.file_core.format') = 'fastq.gz'

[{
  "analysis_file_id": "4187f801-a030-45e7-9186-0162fb052253",
  "content": "{\"describedBy\":\"https://schema.humancellatlas.org/type/file/6.3.0/analysis_file\",\"schema_type\":\"file\",\"file_core\":{\"file_name\":\"AN6698_UMI2.fastq.gz\",\"format\":\"fastq.gz\",\"content_description\":[{\"text\":\"DNA sequence (raw)\",\"ontology\":\"data:3494\",\"ontology_label\":\"DNA sequence\"}],\"file_source\":\"Publication\"},\"provenance\":{\"document_id\":\"4187f801-a030-45e7-9186-0162fb052253\",\"submission_date\":\"2021-11-22T11:13:37.613Z\",\"update_date\":\"2021-11-30T13:52:33.834Z\",\"schema_major_version\":6,\"schema_minor_version\":3}}"
}, {
  "analysis_file_id": "53a287f3-4662-402a-8822-b8c4050a5a62",
  "content": "{\"describedBy\":\"https://schema.humancellatlas.org/type/file/6.3.0/analysis_file\",\"schema_type\":\"file\",\"file_core\":{\"file_name\":\"AP1864_UMI2.fastq.gz\",\"format\":\"fastq.gz\",\"content_description\":[{\"text\":\"DNA sequence (raw)\",\"ontology\":\"data:3494\",\"ontology_label\":\"DNA sequence\"}],\"file_source\":\"Publication\"},\"provenance\":{\"document_id\":\"53a287f3-4662-402a-8822-b8c4050a5a62\",\"submission_date\":\"2021-11-22T11:13:37.671Z\",\"update_date\":\"2021-11-30T13:52:33.859Z\",\"schema_major_version\":6,\"schema_minor_version\":3}}"
}, {
  "analysis_file_id": "5f3614cd-074b-4bfc-a7c8-1982c574150e",
  "content": "{\"describedBy\":\"https://schema.humancellatlas.org/type/file/6.3.0/analysis_file\",\"schema_type\":\"file\",\"file_core\":{\"file_name\":\"AP1865_UMI2.fastq.gz\",\"format\":\"fastq.gz\",\"content_description\":[{\"text\":\"DNA sequence (raw)\",\"ontology\":\"data:3494\",\"ontology_label\":\"DNA sequence\"}],\"file_source\":\"Publication\"},\"provenance\":{\"document_id\":\"5f3614cd-074b-4bfc-a7c8-1982c574150e\",\"submission_date\":\"2021-11-22T11:13:37.691Z\",\"update_date\":\"2021-11-30T13:52:33.864Z\",\"schema_major_version\":6,\"schema_minor_version\":3}}"
}, {
  "analysis_file_id": "bf9e2f20-c74c-47d5-8d47-a9b01e0302ec",
  "content": "{\"describedBy\":\"https://schema.humancellatlas.org/type/file/6.3.0/analysis_file\",\"schema_type\":\"file\",\"file_core\":{\"file_name\":\"AN6700_UMI2.fastq.gz\",\"format\":\"fastq.gz\",\"content_description\":[{\"text\":\"DNA sequence (raw)\",\"ontology\":\"data:3494\",\"ontology_label\":\"DNA sequence\"}],\"file_source\":\"Publication\"},\"provenance\":{\"document_id\":\"bf9e2f20-c74c-47d5-8d47-a9b01e0302ec\",\"submission_date\":\"2021-11-22T11:13:37.641Z\",\"update_date\":\"2021-11-30T13:52:33.845Z\",\"schema_major_version\":6,\"schema_minor_version\":3}}"
}, {
  "analysis_file_id": "e30fa256-e9e3-4265-a7ba-8a2ab05caa34",
  "content": "{\"describedBy\":\"https://schema.humancellatlas.org/type/file/6.3.0/analysis_file\",\"schema_type\":\"file\",\"file_core\":{\"file_name\":\"AN6699_UMI2.fastq.gz\",\"format\":\"fastq.gz\",\"content_description\":[{\"text\":\"DNA sequence (raw)\",\"ontology\":\"data:3494\",\"ontology_label\":\"DNA sequence\"}],\"file_source\":\"Publication\"},\"provenance\":{\"document_id\":\"e30fa256-e9e3-4265-a7ba-8a2ab05caa34\",\"submission_date\":\"2021-11-22T11:13:37.627Z\",\"update_date\":\"2021-11-30T13:52:33.839Z\",\"schema_major_version\":6,\"schema_minor_version\":3}}"
}, {
  "analysis_file_id": "ee268446-370f-4584-948e-c07869165db2",
  "content": "{\"describedBy\":\"https://schema.humancellatlas.org/type/file/6.3.0/analysis_file\",\"schema_type\":\"file\",\"file_core\":{\"file_name\":\"AP1863_UMI2.fastq.gz\",\"format\":\"fastq.gz\",\"content_description\":[{\"text\":\"DNA sequence (raw)\",\"ontology\":\"data:3494\",\"ontology_label\":\"DNA sequence\"}],\"file_source\":\"Publication\"},\"provenance\":{\"document_id\":\"ee268446-370f-4584-948e-c07869165db2\",\"submission_date\":\"2021-11-22T11:13:37.656Z\",\"update_date\":\"2021-11-30T13:52:33.852Z\",\"schema_major_version\":6,\"schema_minor_version\":3}}"
}]
nadove-ucsc commented 1 year ago

This is the only case in dcp31 of FASTQ files that are not sequence_files.

achave11-ucsc commented 1 year ago

This is project ID 5b328561-4a97-40ac-b7ad-6a90fc59d374, added to dcp25 in March of this year.

achave11-ucsc commented 1 year ago

Assignee to contact wranglers about this.

hannes-ucsc commented 1 year ago

https://humancellatlas.slack.com/archives/C9XD6L0AD/p1697650811114159

hannes-ucsc commented 1 year ago

Spike to 1) narrow reproduction (reduce the filters) and 2) to see if other manifest formats are also impacted.

achave11-ucsc commented 1 year ago

Reproduce by running:

http 'https://service.azul.data.humancellatlas.org/fetch/manifest/files?filters={"fileName":{"is":["AP1865_UMI2.fastq.gz"]}}&format=terra.bdbag'

The formats compact, curl and terra.pfb are not impacted by this issue.

achave11-ucsc commented 1 year ago

Spike for next steps.

dsotirho-ucsc commented 1 year ago

@hannes-ucsc: "Wranglers got back to me, reporting that this was an error. They intend to prepare a replacement. Until then, we will remove the offending project from dcp32 using pop. Since we will soon switch to dcp32 as the default, we don't need to worry about dcp31"

hannes-ucsc commented 1 year ago

For demo, attempt to reproduce with dcp31 (if present) and dcp32. It should be reproducible in the former but not in the latter. Use the original reproduction.