Closed achave11-ucsc closed 9 months ago
The following logs, demonstrate that when the servicecachehealth
lambda makes request to the service
lambda to ensure the API endpoints are reachable, the service
lambda may not complete the request in 10 seconds. This time limit is enforced in the servicecachehealth
lambda, so a failure may arise if the service
lambda takes longer than 10s (30s time limit) or if times out.
CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-service-dev-servicecachehealth, /aws/lambda/azul-service-dev, /aws/apigateway/azul-service-dev
start-time: 2023-08-16T12:32:45.000Z
end-time: 2023-08-16T12:33:22.000Z
query-string:
fields @timestamp, @message
| filter @message like /START|HEAD|REPORT/
| sort @timestamp asc
| limit 50
@timestamp | @message |
---|---|
2023-08-16 12:32:49.721 | START RequestId: 9f09e178-8cd3-4701-948a-5f68665d126a Version: $LATEST |
2023-08-16 12:32:49.752 | [INFO] 2023-08-16T12:32:49.752Z 9f09e178-8cd3-4701-948a-5f68665d126a elasticsearch HEAD https://vpc-azul-index-dev-atlcgwfjbk6dvrtmiaskbs2swa.us-east-1.es.amazonaws.com:443/ [status:200 request:0.030s] |
2023-08-16 12:32:50.211 | START RequestId: 4b637057-4100-476a-ae3e-b05ecad5ac49 Version: $LATEST |
2023-08-16 12:32:50.212 | [INFO] 2023-08-16T12:32:50.212Z 4b637057-4100-476a-ae3e-b05ecad5ac49 azul.chalice Received HEAD request for '/index/files', with {"query": {"size": "1"}, "headers": {"accept": "*/*", "accept-encoding": "gzip, deflate", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-asn": "14618", "cloudfront-viewer-country": "US", "host": "service.dev.singlecell.gi.ucsc.edu", "user-agent": "python-requests/2.31.0", "via": "1.1 9d2dee9b44718f249b789987d2cbe62c.cloudfront.net (CloudFront)", "x-amz-cf-id": "hZVDXgYtxIlqmSp_eFHW1FZgHqnLTudrTbvNFql2h-0scfewAD7EiA==", "x-amzn-trace-id": "Root=1-64dcc1f2-19d982e36c4b33d967c93189", "x-forwarded-for": "3.214.122.182, 15.158.50.164", "x-forwarded-port": "443", "x-forwarded-proto": "https"}}. |
2023-08-16 12:32:50.273 | START RequestId: 48d9883f-ad9b-4798-af7b-5003fedf6849 Version: $LATEST |
2023-08-16 12:32:50.274 | [INFO] 2023-08-16T12:32:50.273Z 48d9883f-ad9b-4798-af7b-5003fedf6849 azul.chalice Received HEAD request for '/index/projects', with {"query": {"size": "1"}, "headers": {"accept": "*/*", "accept-encoding": "gzip, deflate", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-asn": "14618", "cloudfront-viewer-country": "US", "host": "service.dev.singlecell.gi.ucsc.edu", "user-agent": "python-requests/2.31.0", "via": "1.1 05133180bbd1649d4b8f97441bf305e8.cloudfront.net (CloudFront)", "x-amz-cf-id": "DF61XHV428nTqus8KPlGOgA8TauvVq0d6EbCX5r3gjwmvui4kzO_Ng==", "x-amzn-trace-id": "Root=1-64dcc1f2-7fb627d96467fa11235d4ba7", "x-forwarded-for": "3.214.122.182, 15.158.50.164", "x-forwarded-port": "443", "x-forwarded-proto": "https"}}. |
2023-08-16 12:32:50.438 | START RequestId: 74ac0b4c-9972-4b45-8825-47f41084628f Version: $LATEST |
2023-08-16 12:32:50.438 | [INFO] 2023-08-16T12:32:50.438Z 74ac0b4c-9972-4b45-8825-47f41084628f azul.chalice Received HEAD request for '/index/bundles', with {"query": {"size": "1"}, "headers": {"accept": "*/*", "accept-encoding": "gzip, deflate", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-asn": "14618", "cloudfront-viewer-country": "US", "host": "service.dev.singlecell.gi.ucsc.edu", "user-agent": "python-requests/2.31.0", "via": "1.1 16d910967d343c8da7828222a653755e.cloudfront.net (CloudFront)", "x-amz-cf-id": "iJoU3qzJgG7ah-JfEkiOCPyR7ptPqwt2g7xY4dvyAChcQBJFx3huZg==", "x-amzn-trace-id": "Root=1-64dcc1f2-77668ef10ef8051827540fe8", "x-forwarded-for": "3.214.122.182, 15.158.50.141", "x-forwarded-port": "443", "x-forwarded-proto": "https"}}. |
2023-08-16 12:32:50.519 | START RequestId: 82b42f17-0990-495a-85df-5ef0ee7b3fb8 Version: $LATEST |
2023-08-16 12:32:50.519 | [INFO] 2023-08-16T12:32:50.519Z 82b42f17-0990-495a-85df-5ef0ee7b3fb8 azul.chalice Received HEAD request for '/index/samples', with {"query": {"size": "1"}, "headers": {"accept": "*/*", "accept-encoding": "gzip, deflate", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-asn": "14618", "cloudfront-viewer-country": "US", "host": "service.dev.singlecell.gi.ucsc.edu", "user-agent": "python-requests/2.31.0", "via": "1.1 f57a09c5455a80253c61001d750462e6.cloudfront.net (CloudFront)", "x-amz-cf-id": "YsGHvYDnzFzj_1cqFTNWvizxUOMo5FPh3CcazfSRAJk-ffxFvc6yBg==", "x-amzn-trace-id": "Root=1-64dcc1f2-659cc21e1897e7485e9ed349", "x-forwarded-for": "3.214.122.182, 15.158.50.151", "x-forwarded-port": "443", "x-forwarded-proto": "https"}}. |
2023-08-16 12:32:50.594 | REPORT RequestId: 48d9883f-ad9b-4798-af7b-5003fedf6849 Duration: 320.56 ms Billed Duration: 321 ms Memory Size: 2048 MB Max Memory Used: 169 MB |
2023-08-16 12:32:51.212 | REPORT RequestId: 74ac0b4c-9972-4b45-8825-47f41084628f Duration: 774.24 ms Billed Duration: 775 ms Memory Size: 2048 MB Max Memory Used: 162 MB |
2023-08-16 12:32:51.420 | REPORT RequestId: 4b637057-4100-476a-ae3e-b05ecad5ac49 Duration: 1208.25 ms Billed Duration: 1209 ms Memory Size: 2048 MB Max Memory Used: 163 MB |
2023-08-16 12:32:59.736 | REPORT RequestId: 9f09e178-8cd3-4701-948a-5f68665d126a Duration: 10015.39 ms Billed Duration: 10000 ms Memory Size: 128 MB Max Memory Used: 121 MB |
2023-08-16 12:32:59.889 | INIT_START Runtime Version: python:3.9.v26 Runtime Version ARN: arn:aws:lambda:us-east-1::runtime:130681a0855afedf31b2b3fbcc2fbf1ca62875e0500edb56fd16cad65045b05b |
2023-08-16 12:33:21.557 | REPORT RequestId: 82b42f17-0990-495a-85df-5ef0ee7b3fb8 Duration: 31037.10 ms Billed Duration: 31000 ms Memory Size: 2048 MB Max Memory Used: 171 MB |
2023-08-16 12:33:21.713 | INIT_START Runtime Version: python:3.9.v26 Runtime Version ARN: arn:aws:lambda:us-east-1::runtime:130681a0855afedf31b2b3fbcc2fbf1ca62875e0500edb56fd16cad65045b05b |
Spike to determine cause of 10+ second response times from the /index/{entity_type}
endpoints.
The execution that caused this error is lacking informative logs. However, after a comparative analysis (reference the logs) one can observe that what should be the same periodic cache generation execution does not consistently emit azul.terra
logs. But, because the elasticsearch
log is always generated, and is emitted after the azul.terra
log, one may infer that this timeout might have been due to Terra, but there is no conclusive logs to demonstrate that.
This ticket should be about determining why the azul.terra
emitted log appears to be transient and if that's an issue. Could it be due to caching?
@nadove-ucsc: "Add log statements to the source service, analogous to the logs we get when we make request TDR requests."
@hannes-ucsc: "~Assignee to determine if all of the timeouts occur in response to requests from the servicecachehealth lambda.~ Assignee to also generalize the ticket description and title to reflect that multiple deployments are affected.
Ttile was updated, assignee to go ahead with adding log statements in SourceService
.
Priority has been increased to match that of the blocking ticket.
For demo of PR #5479, show azul.boto3
logs in service's CW log group in any main deployment. Note that PR is partial, so don't close this issue.
@hannes-ucsc: "PR #5479 added logging of DynamoDB requests and responses. Hopefully this will illuminate what the Lambda is doing when it times out."
Spike to determine prevalence in production for the last four weeks. Specifically HEAD requests timing out.
The service lambda in prod
, does have some HEAD request timing out, but seems like it's due to unrelated reasons.
The cause to this seems to be the issue ~'Elevated rate of Terra DRS timeouts in prod #5870.'~ (edit: Different rate limits depending on HTTP method · Issue #5555 · DataBiosphere/azul )
The following is an exhaustive count of executions that timed out during the past four weeks, the observed burst are all happening less than 6 minutes apart from each other. The HEAD request that timed out during this period were likely due to the ES instance being overwhelmed with manifest requests.
CloudWatch Logs Insights)
region: us-east-1
log-group-names: /aws/lambda/azul-service-prod
start-time: -2419200s
end-time: 0s
query-string:
fields @timestamp, @message, @logStream, @log
| filter @requestId like /6e9e5735|af6298cf|b3e29164|c0e9ffc4|8dbdf814|f4d73c3a|2ef006dc|620bbfad|6363214e|552265ad|7ad0dfe4|589df13c|0931845d|a384aae5|57180d3f|3e9067bc|c913fe3b|002c5d1c|9022a4be|d4263dd0|28fb7c5c|7147aeec|bb83145a|e8b50409|df94e347|c66648eb|ab9aef15|72b79cc7|7158aac1|c23f826f|f754ceef|a3b3cb04|162dbc4f|0b439e28|ae625dc7|31084dfe|b99c45de|38c40734|ae43fca3|59aeb62e|2bbbe543|ac471505|1ee86c43|2f5cf791|778bf8b6|bb40f904|704753c1|2fff3182|3efee156|1b691fd1|70e8e580|69341fd8|d9eff869|fedbedc9|0bc53a0e|63bdc1f4|b1870a68|3e4fcdf7|9f0960e3|cf0055ad|9bfc6edb|de01c6e6|832353e0|ea1e9172|85a36375|99f7b22f|a2188d49|644494cc|f8d422db|c68f074f|3192a652|8b6760be|7f3c2a83|cb2dfdf0|7384402c|33635c2b|e23f6329|93c2856b|594a42c1|808dbd4c|80d167ef|12c6017c|71294e7e|e9cc3a3d|8ab12a90|573a9212|144ccdcf|e869cbdb|6f6fead8|e681a0c3|9406e645|4a2d97d3|f45955c9|dfbb4fde|bae9243b|298c37a0|03704e95|af7bbc7f|aa0cf04d|4ceaed70|6a44d518|2c6a74db|b8b1f47b|752c4d32|96859c95|d4659c96|8a2b32cd|e7f0cde2|3b9e3c09|00375f55|fcc16958|75862a42|154f3b93|f98e81af|3d359633|eabd1168|b69c9616|b529908b|b47ded7f|d57833a7|c1b080e2|b276efe7|d2c6ae50|89ae25c5|6f00948f|aa815900|e692206a|85851601|5ceabf90|ec0ebcad|19d7051f|6282cbfc|4f25f581|bca730bd|503d4e63|ab3d6304|d3f5cec6|c7300537|e326cc47|4bd9320f|2fc86d20|1eef960a|9275c8d7|8a7fb595|8b39c3d1|c30dc815|f2f06b68|e76e9d09|65163d36|3fde56a8|faf7654a|ae4f44ae|4d423037|13c82c3a|4dfaad4b|a5f58c47|bb353a66|c9349aa0|389f2788|fde7550b|64e5322e|682930eb|eb8cf14b|0549ce06|90af5293|7eaa9074|1da5b7ae|c160bff4|7f593319|43dce97a|46a4e492|7a5e4e09|fbcce0a5|5212355c|7a35bf9a|d1d1f720|96ebad91|4e07f9a8|00b7af9a|ea959e7a|dab74885|598fbeb7|1af467fe|6c143012|e2b974a5|5e32580a|e1d3bde4|67a3b383|cbb71520|7b1dea98|127eba29|7b56ce8f|fdafee16|a9d4b183|2bbb4880|7a1eddfc|2f135b58|972f819c|100a3574|a5b5a6a6|0b12c4b8|bd3376f0|4c1621f1|af73af43|e41bce77|7b0b7249|d799c9c6|377b78cb|eb6864ad|3fb7716c|66784134|0d5fe49e|934c230f|ca78da58|642a09eb|973dc0c5|f83ca648|cb769322|8a3f65af|4b768eb0|718923fd|a6910f4e|9b7b9e9f|17c36bf6|263f3ff1|46493824|c58eb105|f7e138fb|e2eccfa9|5effc0b3|dd3afb7b|1d30d69a|d061f1a4|03dde8fd|5150051d|b5f951dc|e43a7bbd|042f9be1|c5857a66|142a223f|ba5710da|44ec97fc|4ed4545c|61670ca6|dbfb37e0|0fee2e00|a451499d|e31c1cdb|ead48514|f6abe0d1|5b6295ed|3fc87383|9d9e1a46|e338d4fe|84beac76|a8d4e335|b35a5dff|b8099fc0|758b9b6b|f74fb81f|0fd69c75|29820d23|ac203f52|80e55890|b2e2c115|6a647a8d|4098c744|f28f5f16|76dc1d4d|683e1761|a7c628a1|7f5a25ad/
| filter @message like 'Received'
| parse @message 'Received * request for *, with' as method, path
| parse `headers.x-forwarded-for` '*,' as ip
| stats count(*) as count by method, path, bin(5min) as time
| sort time, count desc
method | path | time | count |
---|---|---|---|
PUT | '/fetch/manifest/files' | 2024-01-11 20:55:00.000 | 61 |
HEAD | '/index/projects' | 2024-01-11 20:55:00.000 | 1 |
PUT | '/fetch/manifest/files' | 2024-01-11 21:00:00.000 | 61 |
GET | '/index/projects' | 2024-01-11 21:00:00.000 | 6 |
GET | '/index/projects/3089d311-f9ed-44dd-bb10-397059bad4dc' | 2024-01-11 21:00:00.000 | 3 |
GET | '/index/files' | 2024-01-11 21:00:00.000 | 1 |
HEAD | '/index/files' | 2024-01-11 21:00:00.000 | 1 |
GET | '/index/summary' | 2024-01-11 21:00:00.000 | 1 |
GET | '/index/samples' | 2024-01-11 21:00:00.000 | 1 |
PUT | '/fetch/manifest/files' | 2024-01-15 13:50:00.000 | 130 |
GET | '/index/projects/f6133d2a-9f3d-4ef9-9c19-c23d6c7e6cc0' | 2024-01-15 13:50:00.000 | 2 |
HEAD | '/index/samples' | 2024-01-15 13:50:00.000 | 2 |
HEAD | '/index/bundles' | 2024-01-15 13:50:00.000 | 1 |
GET | '/index/projects' | 2024-01-15 13:50:00.000 | 1 |
EDIT: (@achave11-ucsc) Preceding this edit, the reference to the culprit for these false positives was incorrect, it's now fixed.
FWIW, there are no logs matching task timeouts in dev
(account mentioned in issue description).
@hannes-ucsc: "As written, the above comments don't make sense. The HEAD request does not invoke TDR DRS endpoints. Assignee to clarify."
Reference to the culprit of the false positive HEAD task timeout executions in prod
has been updated, now pointing at the correct one (https://github.com/DataBiosphere/azul/issues/5555#issuecomment-1889983058).
@hannes-ucsc: "According to the spike results the prevalence is much reduced. Addressing #5555 will hopefully eliminate the remaining occurrences. Because of that and despite PR #5479 being partial, there is nothing else we need to do for this."
… for a fairly lightweight request. AWS CW Log Insights)