DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Integration test does not cover file downloads for AnVIL #6581

Open hannes-ucsc opened 2 months ago

hannes-ucsc commented 2 months ago

It uses ['fastq', 'fastq.gz'] for the filter but the AnVIL file formats have a leading dot, as in ['.fastq', '.fastq.gz'].

hannes-ucsc commented 2 months ago

Spike to confirm and provide evidence from IT logs.

nadove-ucsc commented 2 months ago

The IT logs were inconclusive since there legitimately were no fastq files indexed during the most recent IT on `anvilprod`. We can observe the request using the wrong filter and getting no hits:

2024-09-17 19:22:25,473    INFO MainThread test.integration_test: Beginning sub-test [single_entity] {'entity_type': 'files', 'catalog': 'anvil7-it'}
2024-09-17 19:22:25,474    INFO MainThread test.integration_test: Making GET request to 'https://service.explore.anvilproject.org/index/files?catalog=anvil7-it&filters=%7B%22files.file_format%22%3A+%7B%22is%22%3A+%5B%22fastq%22%2C+%22fastq.gz%22%5D%7D%7D&size=1&order=asc&sort=files.file_size'
2024-09-17 19:22:25,475   DEBUG MainThread test.integration_test: … without request body
2024-09-17 19:22:25,838    INFO MainThread test.integration_test: Got 200 response after 0.364s from GET to https://service.explore.anvilproject.org/index/files?catalog=anvil7-it&filters=%7B%22files.file_format%22%3A+%7B%22is%22%3A+%5B%22fastq%22%2C+%22fastq.gz%22%5D%7D%7D&size=1&order=asc&sort=files.file_size
2024-09-17 19:22:25,838   DEBUG MainThread test.integration_test: … with response headers HTTPHeaderDict({'Content-Type': 'application/json', 'Content-Length': '1590', 'Connection': 'keep-alive', 'Date': 'Tue, 17 Sep 2024 19:22:25 GMT', 'x-amzn-RequestId': 'dd5ff7de-223b-4836-911c-b9d1278c3984', 'Access-Control-Allow-Origin': '*', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains', 'Access-Control-Allow-Headers': 'Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key', 'X-Frame-Options': 'DENY', 'x-amz-apigw-id': 'eQ6FyFy7IAMEVGQ=', 'Cache-Control': 'no-store', 'X-Content-Type-Options': 'nosniff', 'X-Amzn-Trace-Id': 'Root=1-66e9d6f1-4a61536245c8a5ac7021b6d1;Parent=1bfc5a069914da73;Sampled=0;Lineage=1:45061563:0', 'X-Cache': 'Miss from cloudfront', 'Via': '1.1 ec22576e88e707bf58c11e0ee75d019c.cloudfront.net (CloudFront)', 'X-Amz-Cf-Pop': 'IAD50-C2', 'X-Amz-Cf-Id': 'FRzKdw7-5Wpr1QItxguhEBPvNp6dM8NGwHdcun4eUu63XODlmorq-g=='})
2024-09-17 19:22:25,839   DEBUG MainThread test.integration_test: … with response body b'{"hits":[],"pagination":{"count":0,"total":0,"size":1,"next":null,"previous":null,"pages":0,"sort":"files.file_size","order":...'

And can reproduce using cURL:

$ curl -X 'GET'   'https://service.explore.anvilproject.org/index/files?catalog=anvil7-it&filters=%7B%0A%20%20%22files.file_format%22%3A%20%7B%0A%20%20%20%20%22is%22%3A%20%5B%0A%20%20%20%20%20%20%0A%20%20%20%20%20%20%20%20%22fastq%22%2C%20%22fastq.gz%22%20%20%20%20%20%20%0A%20%20%20%20%5D%0A%20%20%7D%0A%7D&size=10'   -H 'accept: application/json' | jq .pagination
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1589  100  1589    0     0   3195      0 --:--:-- --:--:-- --:--:--  3203
{
  "count": 0,
  "total": 0,
  "size": 10,
  "next": null,
  "previous": null,
  "pages": 0,
  "sort": "files.file_id",
  "order": "asc"
}

But fixing the filter to include the the leading . does not produce a different outcome:

 curl -X 'GET'   'https://service.explore.anvilproject.org/index/files?catalog=anvil7-it&filters=%7B%0A%20%20%22files.file_format%22%3A%20%7B%0A%20%20%20%20%22is%22%3A%20%5B%0A%20%20%20%20%20%20%0A%20%20%20%20%20%20%20%20%22.fastq%22%2C%20%22.fastq.gz%22%20%20%20%20%20%20%0A%20%20%20%20%5D%0A%20%20%7D%0A%7D&size=10'   -H 'accept: application/json' | jq .pagination
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1615  100  1615    0     0   1909      0 --:--:-- --:--:-- --:--:--  1908
{
  "count": 0,
  "total": 0,
  "size": 10,
  "next": null,
  "previous": null,
  "pages": 0,
  "sort": "files.file_id",
  "order": "asc"
}

We can confirm the problem by looking at the anvil7 catalog instead of anvil7-it. Without .:

$ curl -X 'GET'   'https://service.explore.anvilproject.org/index/files?catalog=anvil7&filters=%7B%0A%20%20%22files.file_format%22%3A%20%7B%0A%20%20%20%20%22is%22%3A%20%5B%0A%20%20%20%20%20%20%0A%20%20%20%20%20%20%20%20%22fastq%22%2C%20%22fastq.gz%22%20%20%20%20%20%20%0A%20%20%20%20%5D%0A%20%20%7D%0A%7D&size=10'   -H 'accept: application/json' | jq .pagination
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3867  100  3867    0     0   3526      0  0:00:01  0:00:01 --:--:--  3528
{
  "count": 0,
  "total": 0,
  "size": 10,
  "next": null,
  "previous": null,
  "pages": 0,
  "sort": "files.file_id",
  "order": "asc"
}

With .:

$ curl -X 'GET'   'https://service.explore.anvilproject.org/index/files?catalog=anvil7&filters=%7B%0A%20%20%22files.file_format%22%3A%20%7B%0A%20%20%20%20%22is%22%3A%20%5B%0A%20%20%20%20%20%20%0A%20%20%20%20%20%20%20%20%22.fastq%22%2C%20%22.fastq.gz%22%20%20%20%20%20%20%0A%20%20%20%20%5D%0A%20%20%7D%0A%7D&size=10'   -H 'accept: application/json' | jq .pagination
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20793  100 20793    0     0   9292      0  0:00:02  0:00:02 --:--:--  9295
{
  "count": 10,
  "total": 16312,
  "size": 10,
  "next": "https://service.explore.anvilproject.org/index/files?catalog=anvil7&filters=%7B%22files.file_format%22%3A+%7B%22is%22%3A+%5B%22.fastq%22%2C+%22.fastq.gz%22%5D%7D%7D&search_after=%5B%2200221c3c-d6f7-3805-8f37-07ac1059b122%22%2C+%2233431214-b879-475a-96a2-c0a4442b93d6%22%5D&sort=files.file_id&order=asc&size=10",
  "previous": null,
  "pages": 1632,
  "sort": "files.file_id",
  "order": "asc"
}
achave11-ucsc commented 1 month ago

The BI wants to zero out all the files in the 1000G snapshot in Terra Dev, https://ucsc-gi.slack.com/archives/C03TPJS54DC/p1726758007201959.

They started doing this, causing IT to fail: https://gitlab.anvil.gi.ucsc.edu/ucsc/azul/-/jobs/48873

@nadove-ucsc: "There are two pieces to this puzzle: First, fixing the broken filter so that when fastq files are indexed during the IT from sources other than ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732 we filter for them properly. Second, during IT on AnVIL Dev we will inspect the content length header and the source name, and if the latter matches the 1000G snapshot assert that the former is zero and do not attempt to read from the download."

nadove-ucsc commented 1 month ago

For demo, review IT logs on GitLab anvilprod for an example of a FASTQ file being downloaded. FASTQ files are relatively rare in AnVIL catalogs, so it may take some time for us to observe them being indexed during the IT.

The IT passing will suffice as demo for the lower deployments.

@hannes-ucsc: "Consider looking for IT download requests in the service logs. This wold allow you to find downloads in a single query instead of looking through every single IT job."