Open hannes-ucsc opened 2 months ago
Spike to confirm and provide evidence from IT logs.
The IT logs were inconclusive since there legitimately were no fastq files indexed during the most recent IT on `anvilprod`. We can observe the request using the wrong filter and getting no hits:
2024-09-17 19:22:25,473 INFO MainThread test.integration_test: Beginning sub-test [single_entity] {'entity_type': 'files', 'catalog': 'anvil7-it'}
2024-09-17 19:22:25,474 INFO MainThread test.integration_test: Making GET request to 'https://service.explore.anvilproject.org/index/files?catalog=anvil7-it&filters=%7B%22files.file_format%22%3A+%7B%22is%22%3A+%5B%22fastq%22%2C+%22fastq.gz%22%5D%7D%7D&size=1&order=asc&sort=files.file_size'
2024-09-17 19:22:25,475 DEBUG MainThread test.integration_test: … without request body
2024-09-17 19:22:25,838 INFO MainThread test.integration_test: Got 200 response after 0.364s from GET to https://service.explore.anvilproject.org/index/files?catalog=anvil7-it&filters=%7B%22files.file_format%22%3A+%7B%22is%22%3A+%5B%22fastq%22%2C+%22fastq.gz%22%5D%7D%7D&size=1&order=asc&sort=files.file_size
2024-09-17 19:22:25,838 DEBUG MainThread test.integration_test: … with response headers HTTPHeaderDict({'Content-Type': 'application/json', 'Content-Length': '1590', 'Connection': 'keep-alive', 'Date': 'Tue, 17 Sep 2024 19:22:25 GMT', 'x-amzn-RequestId': 'dd5ff7de-223b-4836-911c-b9d1278c3984', 'Access-Control-Allow-Origin': '*', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains', 'Access-Control-Allow-Headers': 'Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key', 'X-Frame-Options': 'DENY', 'x-amz-apigw-id': 'eQ6FyFy7IAMEVGQ=', 'Cache-Control': 'no-store', 'X-Content-Type-Options': 'nosniff', 'X-Amzn-Trace-Id': 'Root=1-66e9d6f1-4a61536245c8a5ac7021b6d1;Parent=1bfc5a069914da73;Sampled=0;Lineage=1:45061563:0', 'X-Cache': 'Miss from cloudfront', 'Via': '1.1 ec22576e88e707bf58c11e0ee75d019c.cloudfront.net (CloudFront)', 'X-Amz-Cf-Pop': 'IAD50-C2', 'X-Amz-Cf-Id': 'FRzKdw7-5Wpr1QItxguhEBPvNp6dM8NGwHdcun4eUu63XODlmorq-g=='})
2024-09-17 19:22:25,839 DEBUG MainThread test.integration_test: … with response body b'{"hits":[],"pagination":{"count":0,"total":0,"size":1,"next":null,"previous":null,"pages":0,"sort":"files.file_size","order":...'
And can reproduce using cURL:
$ curl -X 'GET' 'https://service.explore.anvilproject.org/index/files?catalog=anvil7-it&filters=%7B%0A%20%20%22files.file_format%22%3A%20%7B%0A%20%20%20%20%22is%22%3A%20%5B%0A%20%20%20%20%20%20%0A%20%20%20%20%20%20%20%20%22fastq%22%2C%20%22fastq.gz%22%20%20%20%20%20%20%0A%20%20%20%20%5D%0A%20%20%7D%0A%7D&size=10' -H 'accept: application/json' | jq .pagination
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1589 100 1589 0 0 3195 0 --:--:-- --:--:-- --:--:-- 3203
{
"count": 0,
"total": 0,
"size": 10,
"next": null,
"previous": null,
"pages": 0,
"sort": "files.file_id",
"order": "asc"
}
But fixing the filter to include the the leading .
does not produce a different outcome:
curl -X 'GET' 'https://service.explore.anvilproject.org/index/files?catalog=anvil7-it&filters=%7B%0A%20%20%22files.file_format%22%3A%20%7B%0A%20%20%20%20%22is%22%3A%20%5B%0A%20%20%20%20%20%20%0A%20%20%20%20%20%20%20%20%22.fastq%22%2C%20%22.fastq.gz%22%20%20%20%20%20%20%0A%20%20%20%20%5D%0A%20%20%7D%0A%7D&size=10' -H 'accept: application/json' | jq .pagination
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1615 100 1615 0 0 1909 0 --:--:-- --:--:-- --:--:-- 1908
{
"count": 0,
"total": 0,
"size": 10,
"next": null,
"previous": null,
"pages": 0,
"sort": "files.file_id",
"order": "asc"
}
We can confirm the problem by looking at the anvil7
catalog instead of anvil7-it
. Without .
:
$ curl -X 'GET' 'https://service.explore.anvilproject.org/index/files?catalog=anvil7&filters=%7B%0A%20%20%22files.file_format%22%3A%20%7B%0A%20%20%20%20%22is%22%3A%20%5B%0A%20%20%20%20%20%20%0A%20%20%20%20%20%20%20%20%22fastq%22%2C%20%22fastq.gz%22%20%20%20%20%20%20%0A%20%20%20%20%5D%0A%20%20%7D%0A%7D&size=10' -H 'accept: application/json' | jq .pagination
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3867 100 3867 0 0 3526 0 0:00:01 0:00:01 --:--:-- 3528
{
"count": 0,
"total": 0,
"size": 10,
"next": null,
"previous": null,
"pages": 0,
"sort": "files.file_id",
"order": "asc"
}
With .
:
$ curl -X 'GET' 'https://service.explore.anvilproject.org/index/files?catalog=anvil7&filters=%7B%0A%20%20%22files.file_format%22%3A%20%7B%0A%20%20%20%20%22is%22%3A%20%5B%0A%20%20%20%20%20%20%0A%20%20%20%20%20%20%20%20%22.fastq%22%2C%20%22.fastq.gz%22%20%20%20%20%20%20%0A%20%20%20%20%5D%0A%20%20%7D%0A%7D&size=10' -H 'accept: application/json' | jq .pagination
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 20793 100 20793 0 0 9292 0 0:00:02 0:00:02 --:--:-- 9295
{
"count": 10,
"total": 16312,
"size": 10,
"next": "https://service.explore.anvilproject.org/index/files?catalog=anvil7&filters=%7B%22files.file_format%22%3A+%7B%22is%22%3A+%5B%22.fastq%22%2C+%22.fastq.gz%22%5D%7D%7D&search_after=%5B%2200221c3c-d6f7-3805-8f37-07ac1059b122%22%2C+%2233431214-b879-475a-96a2-c0a4442b93d6%22%5D&sort=files.file_id&order=asc&size=10",
"previous": null,
"pages": 1632,
"sort": "files.file_id",
"order": "asc"
}
The BI wants to zero out all the files in the 1000G snapshot in Terra Dev, https://ucsc-gi.slack.com/archives/C03TPJS54DC/p1726758007201959.
They started doing this, causing IT to fail: https://gitlab.anvil.gi.ucsc.edu/ucsc/azul/-/jobs/48873
@nadove-ucsc: "There are two pieces to this puzzle: First, fixing the broken filter so that when fastq files are indexed during the IT from sources other than ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732
we filter for them properly. Second, during IT on AnVIL Dev we will inspect the content length header and the source name, and if the latter matches the 1000G snapshot assert that the former is zero and do not attempt to read from the download."
For demo, review IT logs on GitLab anvilprod
for an example of a FASTQ file being downloaded. FASTQ files are relatively rare in AnVIL catalogs, so it may take some time for us to observe them being indexed during the IT.
The IT passing will suffice as demo for the lower deployments.
@hannes-ucsc: "Consider looking for IT download requests in the service logs. This wold allow you to find downloads in a single query instead of looking through every single IT job."
It uses
['fastq', 'fastq.gz']
for the filter but the AnVIL file formats have a leading dot, as in['.fastq', '.fastq.gz']
.