Closed hannes-ucsc closed 2 years ago
Spike to mine CloudWatch and Gitlab logs for dev
and sandbox
to see how prevalent this is. I want to understand if this is just a temporary condition or if this request has been slow for a while now. The number of snapshots increased significantly as part of @aherbst-broad team's effort to copy snapshots from prod for #17.
Spike report:
Failures per day in dev
:
CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-service-dev
Relative: 12M
query-string:
filter @message like "Task timed out after"
| stats count(*) as exceptionCount by bin(24h) as day
| sort day desc
day | exceptionCount |
---|---|
2021-10-09 | 1 |
2021-10-06 | 1 |
2021-10-05 | 2 |
2021-10-04 | 45 |
2021-10-03 | 32 |
2021-10-02 | 49 |
2021-09-30 | 1 |
2021-09-28 | 16 |
2021-09-26 | 1 |
2021-09-25 | 1 |
2021-09-21 | 1 |
2021-09-20 | 1 |
2021-09-19 | 1 |
2021-09-17 | 1 |
2021-09-14 | 16 |
2021-09-13 | 1 |
2021-09-03 | 13 |
2021-09-02 | 70 |
2021-08-09 | 231 |
2021-08-07 | 1049 |
2021-08-06 | 1 |
2021-08-04 | 4 |
2021-08-03 | 9 |
2021-08-02 | 14 |
2021-07-25 | 3 |
2021-07-23 | 3 |
2021-07-22 | 3 |
2021-07-20 | 1 |
2021-07-14 | 2 |
2021-07-13 | 2 |
2021-07-10 | 2 |
2021-06-30 | 2 |
2021-05-01 | 1 |
2021-03-17 | 1 |
2020-12-17 | 8 |
2020-12-16 | 4 |
2020-12-12 | 1 |
2020-11-17 | 11 |
2020-11-16 | 6 |
2020-11-13 | 1 |
2020-11-10 | 1 |
2020-11-04 | 1 |
2020-10-19 | 1 |
From the above we can deduce that this sort of failures might have been lingering even before the increase in number of snapshots. Next, a subset of failing executions of 2021-08-09 for dev
to make a few observations.
CloudWatch Logs Insights region: us-east-1 log-group-names: /aws/lambda/azul-service-dev start-time: 2021-08-07T13:30:24 end-time: 2021-08-09T12:02:15 query-string:
fields @timestamp, @message
# | filter @message like "Task timed out after"
| filter (@requestId like 'eb8a3f23'
or @requestId like 'a45564d6'
or @requestId like 'fb5d42f7'
or @requestId like '2ebb6ab8'
or @requestId like 'f68dfbe9'
or @requestId like 'b80cf29a'
or @requestId like 'ee044031'
or @requestId like 'e4ffd794'
or @requestId like 'eb39ea99'
or @requestId like '1f06eb67'
or @requestId like '82437c90'
or @requestId like 'dfdeaf51'
or @requestId like '44e15480'
or @requestId like 'f5d56e44')
| sort @timestamp asc
| limit 23
@timestamp | @message |
---|---|
2021-08-09 17:35:02.055 | [INFO] 2021-08-09T17:35:02.055Z 44e15480-dc1d-4173-ad4f-0454f74de8b9 Received HEAD request for '/index/files', with query {"size": "1"} and headers {…}. |
2021-08-09 17:35:02.085 | [INFO] 2021-08-09T17:35:02.084Z dfdeaf51-44dc-40fe-9819-6528ee9f9698 Received HEAD request for '/index/bundles', with query {"size": "1"} and headers {…}. |
2021-08-09 17:35:02.797 | [INFO] 2021-08-09T17:35:02.797Z 82437c90-075e-431e-8ac5-ebc82cde8174 Received HEAD request for '/index/samples', with query {"size": "1"} and headers {…}. |
2021-08-09 17:35:02.945 | [DEBUG] 2021-08-09T17:35:02.944Z 44e15480-dc1d-4173-ad4f-0454f74de8b9 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None) |
2021-08-09 17:35:03.095 | [DEBUG] 2021-08-09T17:35:03.095Z dfdeaf51-44dc-40fe-9819-6528ee9f9698 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None) |
2021-08-09 17:35:03.754 | [DEBUG] 2021-08-09T17:35:03.753Z 82437c90-075e-431e-8ac5-ebc82cde8174 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None) |
2021-08-09 17:35:33.086 | 2021-08-09T17:35:33.086Z 44e15480-dc1d-4173-ad4f-0454f74de8b9 Task timed out after 31.03 seconds |
2021-08-09 17:35:33.115 | 2021-08-09T17:35:33.115Z dfdeaf51-44dc-40fe-9819-6528ee9f9698 Task timed out after 31.03 seconds |
2021-08-09 17:35:33.824 | 2021-08-09T17:35:33.823Z 82437c90-075e-431e-8ac5-ebc82cde8174 Task timed out after 31.03 seconds |
2021-08-09 17:35:51.482 | [INFO] 2021-08-09T17:35:51.482Z 1f06eb67-a089-4bd3-b927-1c825231f7da Received HEAD request for '/index/samples', with query {"size": "1"} and headers {…}. |
2021-08-09 17:35:51.600 | [INFO] 2021-08-09T17:35:51.600Z eb39ea99-c001-4ff2-9d16-cc31479c6895 Received HEAD request for '/index/files', with query {"size": "1"} and headers {…}. |
2021-08-09 17:35:51.622 | [INFO] 2021-08-09T17:35:51.622Z e4ffd794-3d5c-4764-8899-b15e6a986247 Received HEAD request for '/index/bundles', with query {"size": "1"} and headers {…}. |
2021-08-09 17:35:52.463 | [DEBUG] 2021-08-09T17:35:52.463Z 1f06eb67-a089-4bd3-b927-1c825231f7da _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None) |
2021-08-09 17:35:52.578 | [DEBUG] 2021-08-09T17:35:52.578Z eb39ea99-c001-4ff2-9d16-cc31479c6895 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None) |
2021-08-09 17:35:52.620 | [DEBUG] 2021-08-09T17:35:52.619Z e4ffd794-3d5c-4764-8899-b15e6a986247 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None) |
2021-08-09 17:35:55.405 | [INFO] 2021-08-09T17:35:55.405Z f5d56e44-27fe-4369-a95a-0b067a02dc97 Received HEAD request for '/index/samples', with query {"size": "1"} and headers {…}. |
2021-08-09 17:35:55.419 | [INFO] 2021-08-09T17:35:55.419Z ee044031-f96b-4dec-b6a8-df7426fee01d Received HEAD request for '/index/files', with query {"size": "1"} and headers {…}. |
2021-08-09 17:35:55.452 | [INFO] 2021-08-09T17:35:55.452Z b80cf29a-76d4-4f30-8c41-dd3c0f33fa22 Received HEAD request for '/index/bundles', with query {"size": "1"} and headers {…}. |
2021-08-09 17:35:56.352 | [DEBUG] 2021-08-09T17:35:56.351Z f5d56e44-27fe-4369-a95a-0b067a02dc97 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None) |
2021-08-09 17:35:56.371 | [DEBUG] 2021-08-09T17:35:56.370Z ee044031-f96b-4dec-b6a8-df7426fee01d _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None) |
2021-08-09 17:35:56.431 | [DEBUG] 2021-08-09T17:35:56.431Z b80cf29a-76d4-4f30-8c41-dd3c0f33fa22 _request('GET', 'https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots', fields={'limit': '100'}, headers=None, body=None) |
2021-08-09 17:36:19.070 | [INFO] 2021-08-09T17:36:19.070Z f68dfbe9-e99f-4f2e-9964-1d6b8780e680 Received HEAD request for '/index/bundles', with query {"size": "1"} and headers {…}. |
2021-08-09 17:36:19.179 | [INFO] 2021-08-09T17:36:19.178Z 2ebb6ab8-5ae0-4029-8cfc-e5ff94bbf9bf Received HEAD request for '/index/files', with query {"size": "1"} and headers {"accept": "*/*", "accept-encoding": "gzip, deflate", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-country": "US", "host": "service.dev.singlecell.gi.ucsc.edu", "user-agent": "python-requests/2.22.0", "via": "1.1 284f4720cb0b553d351a28630b21584a.cloudfront.net (CloudFront)", "x-amz-cf-id": "ckhRx3xho-1ghhyI4PAiXrfNFyuzBlVUMh9pLhMnlfqngEQIRgzVIw==", "x-amzn-trace-id": "Root=1-61116793-0da7f27d59016054706444f0", "x-forwarded-for": "3.89.136.165, 70.132.59.82", "x-forwarded-port": "443", "x-forwarded-proto": "https"}. |
From the above one is able to see the quick succession in which these failing request are executed. Non failing request in the same interval also need to be accounted. There were no IT failures for dev
in GitLab for this observed period. Given that most of the errors happen for HEAD requests with parameter of size=1 (health status checks?), it is acceptable for the errors to not be represented by IT failures.
While the provided data doesn't lead to conclusive results or the actual cause of the failure, a few observations can be made.
sandbox
runs).Infrequent failures in IT for sandbox
(which disappear after a re-run or two) seem to be caused by this sporadic failure in TDR snapshot list. This infrequent failures aren't similar in shape, meaning that sometimes catalog_complete
fails other times is catalog_complete
among other variations. Making it difficult to establish a connection with the cause of the failures.
Failures per day in sandbox
:
CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-service-sandbox
Relative: 12M
query-string:
filter @message like "Task timed out after"
| stats count(*) as exceptionCount by bin(24h) as day
| sort day desc
day | exceptionCount |
---|---|
2021-10-09 | 1 |
2021-10-06 | 2 |
2021-10-04 | 52 |
2021-10-03 | 53 |
2021-10-02 | 54 |
2021-10-01 | 1 |
2021-09-29 | 1 |
2021-09-28 | 13 |
2021-09-26 | 1 |
2021-09-24 | 1 |
2021-09-19 | 1 |
2021-09-16 | 1 |
2021-09-14 | 16 |
2021-09-03 | 4 |
2021-09-02 | 82 |
2021-08-09 | 281 |
2021-08-07 | 1203 |
2021-08-06 | 2 |
2021-08-05 | 3 |
2021-08-04 | 8 |
2021-08-03 | 9 |
2021-08-02 | 7 |
2021-07-31 | 1 |
2021-07-25 | 2 |
2021-07-23 | 2 |
2021-07-22 | 4 |
2021-07-21 | 1 |
2021-01-30 | 4 |
2021-01-12 | 1 |
2021-01-09 | 29 |
2021-01-08 | 47 |
2020-11-02 | 1 |
2020-10-22 | 1 |
We can observe that sandbox
and dev
deployments experience higher failures on the same days.
@hannes-ucsc: "Based on @achave11's analysis this issue isn't very prevalent at this time. The change from 7bd4d4e also helps alleviate the issue."
Azul IT failure in sandbox, subtest
catalog_complete
:https://gitlab.dev.singlecell.gi.ucsc.edu/ucsc/azul/-/jobs/30266#L537
TDR takes 20s for the first page, and 10s into waiting for the second page, the Lambda execution times out and so does API Gateway, hence the 504.
CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-service-sandbox, /aws/apigateway/azul-service-sandbox
start-time: 2021-10-03T05:22:20.657Z
end-time: 2021-10-03T05:35:24.325Z
query-string: