DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

FASTQ files described as analysis_file instead of sequence_file #5606

Closed achave11-ucsc closed 11 months ago

achave11-ucsc commented 1 year ago

… for the following request:

http 'https://service.azul.data.humancellatlas.org/fetch/manifest/files?filters={"genusSpecies":{"is":["Homo sapiens"]},"fileFormat":{"is":[".bed.gz",".csv",".h5",".h5ad",".mtx",".mtx.gz",".rds",".tsv",".tsv.gz",".txt",".txt.gz","asc","bai","bam","bed.gz","cloupe","csv","csv.gz","fastq","fastq.gz","fq","fq.gz","gz","h5","h5.gz","h5ad","h5ad.gz","h5ad.tar.gz","h5ad.zip","jpg","jpg.gz","json","json.gz","loom","loom.gz","md5","mtx","mtx.gz","nd2","ndpi","pkc.gz","png","png.gz","RAW.tar","RData","RData.gz","Rdata.gz","rds","RDS","Rds","rds.gz","Rds.gz","RDS.gz","Robj","Robj.gz","son.gz","tab","tar","tar.gz","tif","tif.gz","tiff","tsv","tsv.gz","txt","txt.gz","xls.gz","xlsx","zip","zip.gz"]}}&catalog=dcp31&format=terra.bdbag'
HTTP/1.1 200 OK
Access-Control-Allow-Headers: Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key
Access-Control-Allow-Origin: *
Connection: keep-alive
Content-Length: 697
Content-Type: application/json
Date: Tue, 10 Oct 2023 16:19:25 GMT
Strict-Transport-Security: max-age=31536000; includeSubDomains
Via: 1.1 538a08eba98551a196e344df4d0dda06.cloudfront.net (CloudFront)
X-Amz-Cf-Id: N3RPPgGlWt8axcrqB3JjR3rryDC8O0SfVIqO0SgsI9AmXG2Z3fSS8A==
X-Amz-Cf-Pop: LAX50-C1
X-Amzn-Trace-Id: Root=1-6525797e-2aade9383dd1abec0183182a;Sampled=0;lineage=2808a99a:0
X-Cache: Miss from cloudfront
x-amz-apigw-id: Ml_r2EoGoAMFkTg=
x-amzn-RequestId: 660cb271-b58a-4339-aa63-5d7c46204dc1

{
    "CommandLine": {
        "bash": "curl 'https://service.azul.data.humancellatlas.org/fetch/manifest/files?token=eyJleGVjdXRpb25faWQiOiAiNjQ4MGU4NGItYzJlYS00YTQyLWJlOWUtMzcyMzY4ZmY5Nzc1IiwgInJlcXVlc3RfaW5kZXgiOiAwLCAid2FpdF90aW1lIjogMX0%3D'",
        "cmd.exe": "curl.exe \"https://service.azul.data.humancellatlas.org/fetch/manifest/files?token=eyJleGVjdXRpb25faWQiOiAiNjQ4MGU4NGItYzJlYS00YTQyLWJlOWUtMzcyMzY4ZmY5Nzc1IiwgInJlcXVlc3RfaW5kZXgiOiAwLCAid2FpdF90aW1lIjogMX0%3D\""
    },
    "Location": "https://service.azul.data.humancellatlas.org/fetch/manifest/files?token=eyJleGVjdXRpb25faWQiOiAiNjQ4MGU4NGItYzJlYS00YTQyLWJlOWUtMzcyMzY4ZmY5Nzc1IiwgInJlcXVlc3RfaW5kZXgiOiAwLCAid2FpdF90aW1lIjogMX0%3D",
    "Retry-After": 1,
    "Status": 301
}
…
http 'https://service.azul.data.humancellatlas.org/fetch/manifest/files?token=eyJleGVjdXRpb25faWQiOiAiNjQ4MGU4NGItYzJlYS00YTQyLWJlOWUtMzcyMzY4ZmY5Nzc1IiwgInJlcXVlc3RfaW5kZXgiOiAwLCAid2FpdF90aW1lIjogMX0%3D'
HTTP/1.1 500 Internal Server Error
Access-Control-Allow-Headers: Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key
Access-Control-Allow-Origin: *
Connection: keep-alive
Content-Length: 782
Content-Type: text/plain
Date: Tue, 10 Oct 2023 16:22:52 GMT
Strict-Transport-Security: max-age=31536000; includeSubDomains
Via: 1.1 ff59c1cd74c841ab9a3ebd5370e3b24a.cloudfront.net (CloudFront)
X-Amz-Cf-Id: KleT1ulXtmFQdiMMUNi2SGUqti6VYB6rM-pcxioTdKr86kAClxBvoA==
X-Amz-Cf-Pop: LAX50-C1
X-Amzn-Trace-Id: Root=1-65257a5c-701b2d5b4ed8a9955963012c;Sampled=0;lineage=2808a99a:0
X-Cache: Error from cloudfront
x-amz-apigw-id: MmAOgHCfIAMF6FA=
x-amzn-RequestId: 83945c47-be64-4b55-8385-c19b998f0980

Traceback (most recent call last):
  File "/var/task/chalice/app.py", line 1913, in _get_view_function_response
    response = view_function(**function_args)
  File "/var/task/app.py", line 1306, in fetch_file_manifest
    return _file_manifest(fetch=True)
  File "/var/task/app.py", line 1328, in _file_manifest
    return app.manifest_controller.get_manifest_async(self_url=app.self_url,
  File "/var/task/azul/service/manifest_controller.py", line 148, in get_manifest_async
    token_or_state = self.async_service.inspect_generation(token)
  File "/var/task/azul/service/async_manifest_service.py", line 100, in inspect_generation
    raise StateMachineError(status, output)
azul.service.async_manifest_service.StateMachineError: ('Failed to generate manifest', 'FAILED', None)

This caused an alarm notification. CW logs indicate that this is due to a KeyError.

[ERROR] KeyError: 'read_index'
Traceback (most recent call last):
  File "/var/task/azul/chalice.py", line 166, in patched_event_source_handler
    return old_handler(self_, event, context)
  File "/var/task/chalice/app.py", line 1752, in __call__
    return self.handler(event_obj)
  File "/var/task/chalice/app.py", line 1708, in __call__
    return self.handler(request, self.next_handler)
  File "/var/task/azul/chalice.py", line 191, in _lambda_context_middleware
    return get_response(event)
  File "/var/task/chalice/app.py", line 1698, in __call__
    return self._original_func(event.to_dict(), event.context)
  File "/var/task/app.py", line 1339, in generate_manifest
    return app.manifest_controller.get_manifest(event)
  File "/var/task/azul/service/manifest_controller.py", line 80, in get_manifest
    result = self.service.get_manifest(format_=ManifestFormat(state['format_']),
  File "/var/task/azul/service/manifest_service.py", line 378, in get_manifest
    partition = generator.write(object_key, partition)
  File "/var/task/azul/service/manifest_service.py", line 1068, in write
    file_path, base_name = self.create_file()
  File "/var/task/azul/service/manifest_service.py", line 1523, in create_file
    self._samples_tsv(samples_tsv)
  File "/var/task/azul/service/manifest_service.py", line 1616, in _samples_tsv
    qualifier = f"fastq_{file['read_index']}"

Original StepF exec: azul-manifest-prod:3dcb2e4a-c485-4476-be4b-5fe3839e7a97 Repro StepF exec: azul-manifest-prod:6480e84b-c2ea-4a42-be9e-372368ff9775

achave11-ucsc commented 1 year ago

Spike to determine if we are justified (based on schema) to assume that read_index is present in every FASTQ file.

nadove-ucsc commented 1 year ago

HCA files are described by one of the following schemas:

FASTQ is a sequence data format, so we would expect all FASTQ files to use the sequence_file schema. read_index is indeed a mandatory field in that schema.

However, I found a snapshot that contains FASTQ files which use the analysis_file schema instead. The read_index field is not mentioned in that schema, and it is missing from these files.

select analysis_file_id, content
from `datarepo-2917ceb6.hca_prod_5b3285614a9740acb7ad6a90fc59d374__20220117_dcp2_20230314_dcp25.analysis_file`
where json_extract_scalar(content, '$.file_core.format') = 'fastq.gz'

[{
  "analysis_file_id": "4187f801-a030-45e7-9186-0162fb052253",
  "content": "{\"describedBy\":\"https://schema.humancellatlas.org/type/file/6.3.0/analysis_file\",\"schema_type\":\"file\",\"file_core\":{\"file_name\":\"AN6698_UMI2.fastq.gz\",\"format\":\"fastq.gz\",\"content_description\":[{\"text\":\"DNA sequence (raw)\",\"ontology\":\"data:3494\",\"ontology_label\":\"DNA sequence\"}],\"file_source\":\"Publication\"},\"provenance\":{\"document_id\":\"4187f801-a030-45e7-9186-0162fb052253\",\"submission_date\":\"2021-11-22T11:13:37.613Z\",\"update_date\":\"2021-11-30T13:52:33.834Z\",\"schema_major_version\":6,\"schema_minor_version\":3}}"
}, {
  "analysis_file_id": "53a287f3-4662-402a-8822-b8c4050a5a62",
  "content": "{\"describedBy\":\"https://schema.humancellatlas.org/type/file/6.3.0/analysis_file\",\"schema_type\":\"file\",\"file_core\":{\"file_name\":\"AP1864_UMI2.fastq.gz\",\"format\":\"fastq.gz\",\"content_description\":[{\"text\":\"DNA sequence (raw)\",\"ontology\":\"data:3494\",\"ontology_label\":\"DNA sequence\"}],\"file_source\":\"Publication\"},\"provenance\":{\"document_id\":\"53a287f3-4662-402a-8822-b8c4050a5a62\",\"submission_date\":\"2021-11-22T11:13:37.671Z\",\"update_date\":\"2021-11-30T13:52:33.859Z\",\"schema_major_version\":6,\"schema_minor_version\":3}}"
}, {
  "analysis_file_id": "5f3614cd-074b-4bfc-a7c8-1982c574150e",
  "content": "{\"describedBy\":\"https://schema.humancellatlas.org/type/file/6.3.0/analysis_file\",\"schema_type\":\"file\",\"file_core\":{\"file_name\":\"AP1865_UMI2.fastq.gz\",\"format\":\"fastq.gz\",\"content_description\":[{\"text\":\"DNA sequence (raw)\",\"ontology\":\"data:3494\",\"ontology_label\":\"DNA sequence\"}],\"file_source\":\"Publication\"},\"provenance\":{\"document_id\":\"5f3614cd-074b-4bfc-a7c8-1982c574150e\",\"submission_date\":\"2021-11-22T11:13:37.691Z\",\"update_date\":\"2021-11-30T13:52:33.864Z\",\"schema_major_version\":6,\"schema_minor_version\":3}}"
}, {
  "analysis_file_id": "bf9e2f20-c74c-47d5-8d47-a9b01e0302ec",
  "content": "{\"describedBy\":\"https://schema.humancellatlas.org/type/file/6.3.0/analysis_file\",\"schema_type\":\"file\",\"file_core\":{\"file_name\":\"AN6700_UMI2.fastq.gz\",\"format\":\"fastq.gz\",\"content_description\":[{\"text\":\"DNA sequence (raw)\",\"ontology\":\"data:3494\",\"ontology_label\":\"DNA sequence\"}],\"file_source\":\"Publication\"},\"provenance\":{\"document_id\":\"bf9e2f20-c74c-47d5-8d47-a9b01e0302ec\",\"submission_date\":\"2021-11-22T11:13:37.641Z\",\"update_date\":\"2021-11-30T13:52:33.845Z\",\"schema_major_version\":6,\"schema_minor_version\":3}}"
}, {
  "analysis_file_id": "e30fa256-e9e3-4265-a7ba-8a2ab05caa34",
  "content": "{\"describedBy\":\"https://schema.humancellatlas.org/type/file/6.3.0/analysis_file\",\"schema_type\":\"file\",\"file_core\":{\"file_name\":\"AN6699_UMI2.fastq.gz\",\"format\":\"fastq.gz\",\"content_description\":[{\"text\":\"DNA sequence (raw)\",\"ontology\":\"data:3494\",\"ontology_label\":\"DNA sequence\"}],\"file_source\":\"Publication\"},\"provenance\":{\"document_id\":\"e30fa256-e9e3-4265-a7ba-8a2ab05caa34\",\"submission_date\":\"2021-11-22T11:13:37.627Z\",\"update_date\":\"2021-11-30T13:52:33.839Z\",\"schema_major_version\":6,\"schema_minor_version\":3}}"
}, {
  "analysis_file_id": "ee268446-370f-4584-948e-c07869165db2",
  "content": "{\"describedBy\":\"https://schema.humancellatlas.org/type/file/6.3.0/analysis_file\",\"schema_type\":\"file\",\"file_core\":{\"file_name\":\"AP1863_UMI2.fastq.gz\",\"format\":\"fastq.gz\",\"content_description\":[{\"text\":\"DNA sequence (raw)\",\"ontology\":\"data:3494\",\"ontology_label\":\"DNA sequence\"}],\"file_source\":\"Publication\"},\"provenance\":{\"document_id\":\"ee268446-370f-4584-948e-c07869165db2\",\"submission_date\":\"2021-11-22T11:13:37.656Z\",\"update_date\":\"2021-11-30T13:52:33.852Z\",\"schema_major_version\":6,\"schema_minor_version\":3}}"
}]
nadove-ucsc commented 1 year ago

This is the only case in dcp31 of FASTQ files that are not sequence_files.

achave11-ucsc commented 1 year ago

This is project ID 5b328561-4a97-40ac-b7ad-6a90fc59d374, added to dcp25 in March of this year.

achave11-ucsc commented 1 year ago

Assignee to contact wranglers about this.

hannes-ucsc commented 12 months ago

https://humancellatlas.slack.com/archives/C9XD6L0AD/p1697650811114159

hannes-ucsc commented 12 months ago

Spike to 1) narrow reproduction (reduce the filters) and 2) to see if other manifest formats are also impacted.

achave11-ucsc commented 12 months ago

Reproduce by running:

http 'https://service.azul.data.humancellatlas.org/fetch/manifest/files?filters={"fileName":{"is":["AP1865_UMI2.fastq.gz"]}}&format=terra.bdbag'

The formats compact, curl and terra.pfb are not impacted by this issue.

achave11-ucsc commented 11 months ago

Spike for next steps.

dsotirho-ucsc commented 11 months ago

@hannes-ucsc: "Wranglers got back to me, reporting that this was an error. They intend to prepare a replacement. Until then, we will remove the offending project from dcp32 using pop. Since we will soon switch to dcp32 as the default, we don't need to worry about dcp31"

hannes-ucsc commented 11 months ago

For demo, attempt to reproduce with dcp31 (if present) and dcp32. It should be reproducible in the former but not in the latter. Use the original reproduction.