HumanCellAtlas / dcp2

Shared artifacts concerning the Human Cell Atlas (HCA) Data Coordination Platform (DCP)
4 stars 2 forks source link

TDR's `GET /repository/v1/snapshots` takes too long #46

Closed hannes-ucsc closed 2 years ago

hannes-ucsc commented 2 years ago

Azul IT failure in sandbox, subtest catalog_complete:

https://gitlab.dev.singlecell.gi.ucsc.edu/ucsc/azul/-/jobs/30266#L537

TDR takes 20s for the first page, and 10s into waiting for the second page, the Lambda execution times out and so does API Gateway, hence the 504.

CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-service-sandbox, /aws/apigateway/azul-service-sandbox
start-time: 2021-10-03T05:22:20.657Z
end-time: 2021-10-03T05:35:24.325Z
query-string:

fields @timestamp, @message
# | filter (@message like 'Received' and @message like 'GET') or status = 504
| filter @requestId = 'ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c'
| sort @timestamp desc
| limit 200

@timestamp @message
2021-10-03 05:23:36.229 END RequestId: ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c
2021-10-03 05:23:36.229 REPORT RequestId: ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c Duration: 31032.17 ms Billed Duration: 31000 ms Memory Size: 1024 MB Max Memory Used: 149 MB
2021-10-03 05:23:36.229 2021-10-03T05:23:36.228Z ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c Task timed out after 31.03 seconds
2021-10-03 05:23:25.546 [DEBUG] 2021-10-03T05:23:25.546Z ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'offset': 100, 'limit': 100}, headers=None, body=None)
2021-10-03 05:23:25.545 [DEBUG] 2021-10-03T05:23:25.545Z ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c _request(…) -> b'{"total":138,"filteredTotal":138,"items":[{"id":"93dc1e76-8f1c-4949-8f9b-07a087f3ce7b","name":"WorkflowAnalysisSnapshot_bam","description":"Snapshot for public dataset 1000 Genomes.","createdDate":"2020-04-27T19:45:07.628232Z","profileId":"390e7a85-d47f...'
2021-10-03 05:23:05.209 [DEBUG] 2021-10-03T05:23:05.209Z ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'offset': 0, 'limit': 100}, headers=None, body=None)
2021-10-03 05:23:05.198 [INFO] 2021-10-03T05:23:05.197Z ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c Authenticated request as OAuth2(access_token='REDACTED')
2021-10-03 05:23:05.197 [INFO] 2021-10-03T05:23:05.197Z ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c Received GET request for '/index/files', with query {"catalog": "it2", "filters": "{}", "size": "100"} and headers {"accept-encoding": "identity", "authorization": "Bearer REDACTED", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-country": "US", "host": "service.sandbox.dev.singlecell.gi.ucsc.edu", "user-agent": "python-urllib3/1.26.5", "via": "1.1 ca8d1424de70ce439236d37048e65f54.cloudfront.net (CloudFront)", "x-amz-cf-id": "FbxN6vVYXryNxRL9AkNMfj03HVQSxiUVF119vgzBIIva4M3bO_kuLQ==", "x-amzn-trace-id": "Root=1-61593e39-0fb134250b0f9d29436726d7", "x-forwarded-for": "52.20.215.168, 70.132.60.159", "x-forwarded-port": "443", "x-forwarded-proto": "https"}.
2021-10-03 05:23:05.193 START RequestId: ca6c7324-7f5a-482f-bf00-e0b1ad5a2a7c Version: $LATEST

hannes-ucsc commented 2 years ago

Spike to mine CloudWatch and Gitlab logs for dev and sandbox to see how prevalent this is. I want to understand if this is just a temporary condition or if this request has been slow for a while now. The number of snapshots increased significantly as part of @aherbst-broad team's effort to copy snapshots from prod for #17.

achave11-ucsc commented 2 years ago

Spike report:

Failures per day in dev: CloudWatch Logs Insights region: us-east-1 log-group-names: /aws/lambda/azul-service-dev Relative: 12M query-string:

filter @message like "Task timed out after"
| stats count(*) as exceptionCount by bin(24h) as day
| sort day desc
day exceptionCount
2021-10-09 1
2021-10-06 1
2021-10-05 2
2021-10-04 45
2021-10-03 32
2021-10-02 49
2021-09-30 1
2021-09-28 16
2021-09-26 1
2021-09-25 1
2021-09-21 1
2021-09-20 1
2021-09-19 1
2021-09-17 1
2021-09-14 16
2021-09-13 1
2021-09-03 13
2021-09-02 70
2021-08-09 231
2021-08-07 1049
2021-08-06 1
2021-08-04 4
2021-08-03 9
2021-08-02 14
2021-07-25 3
2021-07-23 3
2021-07-22 3
2021-07-20 1
2021-07-14 2
2021-07-13 2
2021-07-10 2
2021-06-30 2
2021-05-01 1
2021-03-17 1
2020-12-17 8
2020-12-16 4
2020-12-12 1
2020-11-17 11
2020-11-16 6
2020-11-13 1
2020-11-10 1
2020-11-04 1
2020-10-19 1

From the above we can deduce that this sort of failures might have been lingering even before the increase in number of snapshots. Next, a subset of failing executions of 2021-08-09 for dev to make a few observations.

CloudWatch Logs Insights region: us-east-1 log-group-names: /aws/lambda/azul-service-dev start-time: 2021-08-07T13:30:24 end-time: 2021-08-09T12:02:15 query-string:

fields @timestamp, @message
# | filter @message like "Task timed out after"
| filter (@requestId like 'eb8a3f23' 
or @requestId like 'a45564d6'
or @requestId like 'fb5d42f7'
or @requestId like '2ebb6ab8'
or @requestId like 'f68dfbe9'
or @requestId like 'b80cf29a'
or @requestId like 'ee044031'
or @requestId like 'e4ffd794'
or @requestId like 'eb39ea99'
or @requestId like '1f06eb67'
or @requestId like '82437c90'
or @requestId like 'dfdeaf51'
or @requestId like '44e15480'
or @requestId like 'f5d56e44')
| sort @timestamp asc
| limit 23
@timestamp @message
2021-08-09 17:35:02.055 [INFO] 2021-08-09T17:35:02.055Z 44e15480-dc1d-4173-ad4f-0454f74de8b9 Received HEAD request for '/index/files', with query {"size": "1"} and headers {…}.
2021-08-09 17:35:02.085 [INFO] 2021-08-09T17:35:02.084Z dfdeaf51-44dc-40fe-9819-6528ee9f9698 Received HEAD request for '/index/bundles', with query {"size": "1"} and headers {…}.
2021-08-09 17:35:02.797 [INFO] 2021-08-09T17:35:02.797Z 82437c90-075e-431e-8ac5-ebc82cde8174 Received HEAD request for '/index/samples', with query {"size": "1"} and headers {…}.
2021-08-09 17:35:02.945 [DEBUG] 2021-08-09T17:35:02.944Z 44e15480-dc1d-4173-ad4f-0454f74de8b9 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None)
2021-08-09 17:35:03.095 [DEBUG] 2021-08-09T17:35:03.095Z dfdeaf51-44dc-40fe-9819-6528ee9f9698 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None)
2021-08-09 17:35:03.754 [DEBUG] 2021-08-09T17:35:03.753Z 82437c90-075e-431e-8ac5-ebc82cde8174 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None)
2021-08-09 17:35:33.086 2021-08-09T17:35:33.086Z 44e15480-dc1d-4173-ad4f-0454f74de8b9 Task timed out after 31.03 seconds

2021-08-09 17:35:33.115 2021-08-09T17:35:33.115Z dfdeaf51-44dc-40fe-9819-6528ee9f9698 Task timed out after 31.03 seconds

2021-08-09 17:35:33.824 2021-08-09T17:35:33.823Z 82437c90-075e-431e-8ac5-ebc82cde8174 Task timed out after 31.03 seconds

2021-08-09 17:35:51.482 [INFO] 2021-08-09T17:35:51.482Z 1f06eb67-a089-4bd3-b927-1c825231f7da Received HEAD request for '/index/samples', with query {"size": "1"} and headers {…}.
2021-08-09 17:35:51.600 [INFO] 2021-08-09T17:35:51.600Z eb39ea99-c001-4ff2-9d16-cc31479c6895 Received HEAD request for '/index/files', with query {"size": "1"} and headers {…}.
2021-08-09 17:35:51.622 [INFO] 2021-08-09T17:35:51.622Z e4ffd794-3d5c-4764-8899-b15e6a986247 Received HEAD request for '/index/bundles', with query {"size": "1"} and headers {…}.
2021-08-09 17:35:52.463 [DEBUG] 2021-08-09T17:35:52.463Z 1f06eb67-a089-4bd3-b927-1c825231f7da _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None)
2021-08-09 17:35:52.578 [DEBUG] 2021-08-09T17:35:52.578Z eb39ea99-c001-4ff2-9d16-cc31479c6895 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None)
2021-08-09 17:35:52.620 [DEBUG] 2021-08-09T17:35:52.619Z e4ffd794-3d5c-4764-8899-b15e6a986247 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None)
2021-08-09 17:35:55.405 [INFO] 2021-08-09T17:35:55.405Z f5d56e44-27fe-4369-a95a-0b067a02dc97 Received HEAD request for '/index/samples', with query {"size": "1"} and headers {…}.
2021-08-09 17:35:55.419 [INFO] 2021-08-09T17:35:55.419Z ee044031-f96b-4dec-b6a8-df7426fee01d Received HEAD request for '/index/files', with query {"size": "1"} and headers {…}.
2021-08-09 17:35:55.452 [INFO] 2021-08-09T17:35:55.452Z b80cf29a-76d4-4f30-8c41-dd3c0f33fa22 Received HEAD request for '/index/bundles', with query {"size": "1"} and headers {…}.
2021-08-09 17:35:56.352 [DEBUG] 2021-08-09T17:35:56.351Z f5d56e44-27fe-4369-a95a-0b067a02dc97 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None)
2021-08-09 17:35:56.371 [DEBUG] 2021-08-09T17:35:56.370Z ee044031-f96b-4dec-b6a8-df7426fee01d _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None)
2021-08-09 17:35:56.431 [DEBUG] 2021-08-09T17:35:56.431Z b80cf29a-76d4-4f30-8c41-dd3c0f33fa22 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None)
2021-08-09 17:36:19.070 [INFO] 2021-08-09T17:36:19.070Z f68dfbe9-e99f-4f2e-9964-1d6b8780e680 Received HEAD request for '/index/bundles', with query {"size": "1"} and headers {…}.
2021-08-09 17:36:19.179 [INFO] 2021-08-09T17:36:19.178Z 2ebb6ab8-5ae0-4029-8cfc-e5ff94bbf9bf Received HEAD request for '/index/files', with query {"size": "1"} and headers {"accept": "*/*", "accept-encoding": "gzip, deflate", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-country": "US", "host": "service.dev.singlecell.gi.ucsc.edu", "user-agent": "python-requests/2.22.0", "via": "1.1 284f4720cb0b553d351a28630b21584a.cloudfront.net (CloudFront)", "x-amz-cf-id": "ckhRx3xho-1ghhyI4PAiXrfNFyuzBlVUMh9pLhMnlfqngEQIRgzVIw==", "x-amzn-trace-id": "Root=1-61116793-0da7f27d59016054706444f0", "x-forwarded-for": "3.89.136.165, 70.132.59.82", "x-forwarded-port": "443", "x-forwarded-proto": "https"}.

From the above one is able to see the quick succession in which these failing request are executed. Non failing request in the same interval also need to be accounted. There were no IT failures for dev in GitLab for this observed period. Given that most of the errors happen for HEAD requests with parameter of size=1 (health status checks?), it is acceptable for the errors to not be represented by IT failures.

While the provided data doesn't lead to conclusive results or the actual cause of the failure, a few observations can be made.

  1. Failures happen for health requests (HEAD request w/ size=1) beside normal runs and IT test.
  2. All HEAD request with size=1 seem to list all snapshots from TDR, (is it absolutely necessary?)
  3. Failures seem to appear on requests that happen in quick succession (see times in table above). (Or they failed because they simultaneously requested TDR snapshot listing?).
  4. While listing TDR snapshots through service (indirectly) failed occasionally, the IT specific to that functionality seems to have been consistently successful (visible in sandbox runs).

Infrequent failures in IT for sandbox (which disappear after a re-run or two) seem to be caused by this sporadic failure in TDR snapshot list. This infrequent failures aren't similar in shape, meaning that sometimes catalog_complete fails other times is catalog_complete among other variations. Making it difficult to establish a connection with the cause of the failures.

Failures per day in sandbox: CloudWatch Logs Insights region: us-east-1 log-group-names: /aws/lambda/azul-service-sandbox Relative: 12M query-string:

filter @message like "Task timed out after"
| stats count(*) as exceptionCount by bin(24h) as day
| sort day desc
day exceptionCount
2021-10-09 1
2021-10-06 2
2021-10-04 52
2021-10-03 53
2021-10-02 54
2021-10-01 1
2021-09-29 1
2021-09-28 13
2021-09-26 1
2021-09-24 1
2021-09-19 1
2021-09-16 1
2021-09-14 16
2021-09-03 4
2021-09-02 82
2021-08-09 281
2021-08-07 1203
2021-08-06 2
2021-08-05 3
2021-08-04 8
2021-08-03 9
2021-08-02 7
2021-07-31 1
2021-07-25 2
2021-07-23 2
2021-07-22 4
2021-07-21 1
2021-01-30 4
2021-01-12 1
2021-01-09 29
2021-01-08 47
2020-11-02 1
2020-10-22 1

We can observe that sandbox and dev deployments experience higher failures on the same days.

achave11-ucsc commented 2 years ago

@hannes-ucsc: "Based on @achave11's analysis this issue isn't very prevalent at this time. The change from 7bd4d4e also helps alleviate the issue."