DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Service Lambda times out, causing alarms #5472

Closed achave11-ucsc closed 9 months ago

achave11-ucsc commented 1 year ago

… for a fairly lightweight request. AWS CW Log Insights)

Received HEAD request for '/index/samples', with {"query": {"size": "1"}, "headers": {"accept": "*/*", "accept-encoding": "gzip, deflate", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-asn": "14618", "cloudfront-viewer-country": "US", "host": "service.dev.singlecell.gi.ucsc.edu", "user-agent": "python-requests/2.31.0", "via": "1.1 f57a09c5455a80253c61001d750462e6.cloudfront.net (CloudFront)", "x-amz-cf-id": "YsGHvYDnzFzj_1cqFTNWvizxUOMo5FPh3CcazfSRAJk-ffxFvc6yBg==", "x-amzn-trace-id": "Root=1-64dcc1f2-659cc21e1897e7485e9ed349", "x-forwarded-for": "3.214.122.182, 15.158.50.151", "x-forwarded-port": "443", "x-forwarded-proto": "https"}}.
achave11-ucsc commented 1 year ago

The following logs, demonstrate that when the servicecachehealth lambda makes request to the service lambda to ensure the API endpoints are reachable, the service lambda may not complete the request in 10 seconds. This time limit is enforced in the servicecachehealth lambda, so a failure may arise if the service lambda takes longer than 10s (30s time limit) or if times out.

CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-service-dev-servicecachehealth, /aws/lambda/azul-service-dev, /aws/apigateway/azul-service-dev
start-time: 2023-08-16T12:32:45.000Z
end-time: 2023-08-16T12:33:22.000Z
query-string:

fields @timestamp, @message
| filter @message like /START|HEAD|REPORT/
| sort @timestamp asc
| limit 50
@timestamp @message
2023-08-16 12:32:49.721 START RequestId: 9f09e178-8cd3-4701-948a-5f68665d126a Version: $LATEST
2023-08-16 12:32:49.752 [INFO] 2023-08-16T12:32:49.752Z 9f09e178-8cd3-4701-948a-5f68665d126a elasticsearch HEAD https://vpc-azul-index-dev-atlcgwfjbk6dvrtmiaskbs2swa.us-east-1.es.amazonaws.com:443/ [status:200 request:0.030s]
2023-08-16 12:32:50.211 START RequestId: 4b637057-4100-476a-ae3e-b05ecad5ac49 Version: $LATEST
2023-08-16 12:32:50.212 [INFO] 2023-08-16T12:32:50.212Z 4b637057-4100-476a-ae3e-b05ecad5ac49 azul.chalice Received HEAD request for '/index/files', with {"query": {"size": "1"}, "headers": {"accept": "*/*", "accept-encoding": "gzip, deflate", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-asn": "14618", "cloudfront-viewer-country": "US", "host": "service.dev.singlecell.gi.ucsc.edu", "user-agent": "python-requests/2.31.0", "via": "1.1 9d2dee9b44718f249b789987d2cbe62c.cloudfront.net (CloudFront)", "x-amz-cf-id": "hZVDXgYtxIlqmSp_eFHW1FZgHqnLTudrTbvNFql2h-0scfewAD7EiA==", "x-amzn-trace-id": "Root=1-64dcc1f2-19d982e36c4b33d967c93189", "x-forwarded-for": "3.214.122.182, 15.158.50.164", "x-forwarded-port": "443", "x-forwarded-proto": "https"}}.
2023-08-16 12:32:50.273 START RequestId: 48d9883f-ad9b-4798-af7b-5003fedf6849 Version: $LATEST
2023-08-16 12:32:50.274 [INFO] 2023-08-16T12:32:50.273Z 48d9883f-ad9b-4798-af7b-5003fedf6849 azul.chalice Received HEAD request for '/index/projects', with {"query": {"size": "1"}, "headers": {"accept": "*/*", "accept-encoding": "gzip, deflate", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-asn": "14618", "cloudfront-viewer-country": "US", "host": "service.dev.singlecell.gi.ucsc.edu", "user-agent": "python-requests/2.31.0", "via": "1.1 05133180bbd1649d4b8f97441bf305e8.cloudfront.net (CloudFront)", "x-amz-cf-id": "DF61XHV428nTqus8KPlGOgA8TauvVq0d6EbCX5r3gjwmvui4kzO_Ng==", "x-amzn-trace-id": "Root=1-64dcc1f2-7fb627d96467fa11235d4ba7", "x-forwarded-for": "3.214.122.182, 15.158.50.164", "x-forwarded-port": "443", "x-forwarded-proto": "https"}}.
2023-08-16 12:32:50.438 START RequestId: 74ac0b4c-9972-4b45-8825-47f41084628f Version: $LATEST
2023-08-16 12:32:50.438 [INFO] 2023-08-16T12:32:50.438Z 74ac0b4c-9972-4b45-8825-47f41084628f azul.chalice Received HEAD request for '/index/bundles', with {"query": {"size": "1"}, "headers": {"accept": "*/*", "accept-encoding": "gzip, deflate", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-asn": "14618", "cloudfront-viewer-country": "US", "host": "service.dev.singlecell.gi.ucsc.edu", "user-agent": "python-requests/2.31.0", "via": "1.1 16d910967d343c8da7828222a653755e.cloudfront.net (CloudFront)", "x-amz-cf-id": "iJoU3qzJgG7ah-JfEkiOCPyR7ptPqwt2g7xY4dvyAChcQBJFx3huZg==", "x-amzn-trace-id": "Root=1-64dcc1f2-77668ef10ef8051827540fe8", "x-forwarded-for": "3.214.122.182, 15.158.50.141", "x-forwarded-port": "443", "x-forwarded-proto": "https"}}.
2023-08-16 12:32:50.519 START RequestId: 82b42f17-0990-495a-85df-5ef0ee7b3fb8 Version: $LATEST
2023-08-16 12:32:50.519 [INFO] 2023-08-16T12:32:50.519Z 82b42f17-0990-495a-85df-5ef0ee7b3fb8 azul.chalice Received HEAD request for '/index/samples', with {"query": {"size": "1"}, "headers": {"accept": "*/*", "accept-encoding": "gzip, deflate", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-asn": "14618", "cloudfront-viewer-country": "US", "host": "service.dev.singlecell.gi.ucsc.edu", "user-agent": "python-requests/2.31.0", "via": "1.1 f57a09c5455a80253c61001d750462e6.cloudfront.net (CloudFront)", "x-amz-cf-id": "YsGHvYDnzFzj_1cqFTNWvizxUOMo5FPh3CcazfSRAJk-ffxFvc6yBg==", "x-amzn-trace-id": "Root=1-64dcc1f2-659cc21e1897e7485e9ed349", "x-forwarded-for": "3.214.122.182, 15.158.50.151", "x-forwarded-port": "443", "x-forwarded-proto": "https"}}.
2023-08-16 12:32:50.594 REPORT RequestId: 48d9883f-ad9b-4798-af7b-5003fedf6849 Duration: 320.56 ms Billed Duration: 321 ms Memory Size: 2048 MB Max Memory Used: 169 MB
2023-08-16 12:32:51.212 REPORT RequestId: 74ac0b4c-9972-4b45-8825-47f41084628f Duration: 774.24 ms Billed Duration: 775 ms Memory Size: 2048 MB Max Memory Used: 162 MB
2023-08-16 12:32:51.420 REPORT RequestId: 4b637057-4100-476a-ae3e-b05ecad5ac49 Duration: 1208.25 ms Billed Duration: 1209 ms Memory Size: 2048 MB Max Memory Used: 163 MB
2023-08-16 12:32:59.736 REPORT RequestId: 9f09e178-8cd3-4701-948a-5f68665d126a Duration: 10015.39 ms Billed Duration: 10000 ms Memory Size: 128 MB Max Memory Used: 121 MB
2023-08-16 12:32:59.889 INIT_START Runtime Version: python:3.9.v26 Runtime Version ARN: arn:aws:lambda:us-east-1::runtime:130681a0855afedf31b2b3fbcc2fbf1ca62875e0500edb56fd16cad65045b05b
2023-08-16 12:33:21.557 REPORT RequestId: 82b42f17-0990-495a-85df-5ef0ee7b3fb8 Duration: 31037.10 ms Billed Duration: 31000 ms Memory Size: 2048 MB Max Memory Used: 171 MB
2023-08-16 12:33:21.713 INIT_START Runtime Version: python:3.9.v26 Runtime Version ARN: arn:aws:lambda:us-east-1::runtime:130681a0855afedf31b2b3fbcc2fbf1ca62875e0500edb56fd16cad65045b05b
achave11-ucsc commented 1 year ago

Spike to determine cause of 10+ second response times from the /index/{entity_type} endpoints.

achave11-ucsc commented 1 year ago

AWS Log Insights)

The execution that caused this error is lacking informative logs. However, after a comparative analysis (reference the logs) one can observe that what should be the same periodic cache generation execution does not consistently emit azul.terra logs. But, because the elasticsearch log is always generated, and is emitted after the azul.terra log, one may infer that this timeout might have been due to Terra, but there is no conclusive logs to demonstrate that.

This ticket should be about determining why the azul.terra emitted log appears to be transient and if that's an issue. Could it be due to caching?

achave11-ucsc commented 1 year ago

@nadove-ucsc: "Add log statements to the source service, analogous to the logs we get when we make request TDR requests."

dsotirho-ucsc commented 1 year ago

@hannes-ucsc: "~Assignee to determine if all of the timeouts occur in response to requests from the servicecachehealth lambda.~ Assignee to also generalize the ticket description and title to reflect that multiple deployments are affected.

dsotirho-ucsc commented 1 year ago

Ttile was updated, assignee to go ahead with adding log statements in SourceService.

achave11-ucsc commented 1 year ago

Priority has been increased to match that of the blocking ticket.

hannes-ucsc commented 11 months ago

For demo of PR #5479, show azul.boto3 logs in service's CW log group in any main deployment. Note that PR is partial, so don't close this issue.

dsotirho-ucsc commented 11 months ago

@hannes-ucsc: "PR #5479 added logging of DynamoDB requests and responses. Hopefully this will illuminate what the Lambda is doing when it times out."

achave11-ucsc commented 10 months ago

Spike to determine prevalence in production for the last four weeks. Specifically HEAD requests timing out.

achave11-ucsc commented 10 months ago

The service lambda in prod, does have some HEAD request timing out, but seems like it's due to unrelated reasons.

The cause to this seems to be the issue ~'Elevated rate of Terra DRS timeouts in prod #5870.'~ (edit: Different rate limits depending on HTTP method · Issue #5555 · DataBiosphere/azul )

The following is an exhaustive count of executions that timed out during the past four weeks, the observed burst are all happening less than 6 minutes apart from each other. The HEAD request that timed out during this period were likely due to the ES instance being overwhelmed with manifest requests.

CloudWatch Logs Insights)
region: us-east-1
log-group-names: /aws/lambda/azul-service-prod
start-time: -2419200s
end-time: 0s
query-string:

  fields @timestamp, @message, @logStream, @log
| filter @requestId like /6e9e5735|af6298cf|b3e29164|c0e9ffc4|8dbdf814|f4d73c3a|2ef006dc|620bbfad|6363214e|552265ad|7ad0dfe4|589df13c|0931845d|a384aae5|57180d3f|3e9067bc|c913fe3b|002c5d1c|9022a4be|d4263dd0|28fb7c5c|7147aeec|bb83145a|e8b50409|df94e347|c66648eb|ab9aef15|72b79cc7|7158aac1|c23f826f|f754ceef|a3b3cb04|162dbc4f|0b439e28|ae625dc7|31084dfe|b99c45de|38c40734|ae43fca3|59aeb62e|2bbbe543|ac471505|1ee86c43|2f5cf791|778bf8b6|bb40f904|704753c1|2fff3182|3efee156|1b691fd1|70e8e580|69341fd8|d9eff869|fedbedc9|0bc53a0e|63bdc1f4|b1870a68|3e4fcdf7|9f0960e3|cf0055ad|9bfc6edb|de01c6e6|832353e0|ea1e9172|85a36375|99f7b22f|a2188d49|644494cc|f8d422db|c68f074f|3192a652|8b6760be|7f3c2a83|cb2dfdf0|7384402c|33635c2b|e23f6329|93c2856b|594a42c1|808dbd4c|80d167ef|12c6017c|71294e7e|e9cc3a3d|8ab12a90|573a9212|144ccdcf|e869cbdb|6f6fead8|e681a0c3|9406e645|4a2d97d3|f45955c9|dfbb4fde|bae9243b|298c37a0|03704e95|af7bbc7f|aa0cf04d|4ceaed70|6a44d518|2c6a74db|b8b1f47b|752c4d32|96859c95|d4659c96|8a2b32cd|e7f0cde2|3b9e3c09|00375f55|fcc16958|75862a42|154f3b93|f98e81af|3d359633|eabd1168|b69c9616|b529908b|b47ded7f|d57833a7|c1b080e2|b276efe7|d2c6ae50|89ae25c5|6f00948f|aa815900|e692206a|85851601|5ceabf90|ec0ebcad|19d7051f|6282cbfc|4f25f581|bca730bd|503d4e63|ab3d6304|d3f5cec6|c7300537|e326cc47|4bd9320f|2fc86d20|1eef960a|9275c8d7|8a7fb595|8b39c3d1|c30dc815|f2f06b68|e76e9d09|65163d36|3fde56a8|faf7654a|ae4f44ae|4d423037|13c82c3a|4dfaad4b|a5f58c47|bb353a66|c9349aa0|389f2788|fde7550b|64e5322e|682930eb|eb8cf14b|0549ce06|90af5293|7eaa9074|1da5b7ae|c160bff4|7f593319|43dce97a|46a4e492|7a5e4e09|fbcce0a5|5212355c|7a35bf9a|d1d1f720|96ebad91|4e07f9a8|00b7af9a|ea959e7a|dab74885|598fbeb7|1af467fe|6c143012|e2b974a5|5e32580a|e1d3bde4|67a3b383|cbb71520|7b1dea98|127eba29|7b56ce8f|fdafee16|a9d4b183|2bbb4880|7a1eddfc|2f135b58|972f819c|100a3574|a5b5a6a6|0b12c4b8|bd3376f0|4c1621f1|af73af43|e41bce77|7b0b7249|d799c9c6|377b78cb|eb6864ad|3fb7716c|66784134|0d5fe49e|934c230f|ca78da58|642a09eb|973dc0c5|f83ca648|cb769322|8a3f65af|4b768eb0|718923fd|a6910f4e|9b7b9e9f|17c36bf6|263f3ff1|46493824|c58eb105|f7e138fb|e2eccfa9|5effc0b3|dd3afb7b|1d30d69a|d061f1a4|03dde8fd|5150051d|b5f951dc|e43a7bbd|042f9be1|c5857a66|142a223f|ba5710da|44ec97fc|4ed4545c|61670ca6|dbfb37e0|0fee2e00|a451499d|e31c1cdb|ead48514|f6abe0d1|5b6295ed|3fc87383|9d9e1a46|e338d4fe|84beac76|a8d4e335|b35a5dff|b8099fc0|758b9b6b|f74fb81f|0fd69c75|29820d23|ac203f52|80e55890|b2e2c115|6a647a8d|4098c744|f28f5f16|76dc1d4d|683e1761|a7c628a1|7f5a25ad/
| filter @message like 'Received'
| parse @message 'Received * request for *, with' as method, path
| parse `headers.x-forwarded-for` '*,' as ip
| stats count(*) as count by method, path, bin(5min) as time
| sort time, count desc

method path time count
PUT '/fetch/manifest/files' 2024-01-11 20:55:00.000 61
HEAD '/index/projects' 2024-01-11 20:55:00.000 1
PUT '/fetch/manifest/files' 2024-01-11 21:00:00.000 61
GET '/index/projects' 2024-01-11 21:00:00.000 6
GET '/index/projects/3089d311-f9ed-44dd-bb10-397059bad4dc' 2024-01-11 21:00:00.000 3
GET '/index/files' 2024-01-11 21:00:00.000 1
HEAD '/index/files' 2024-01-11 21:00:00.000 1
GET '/index/summary' 2024-01-11 21:00:00.000 1
GET '/index/samples' 2024-01-11 21:00:00.000 1
PUT '/fetch/manifest/files' 2024-01-15 13:50:00.000 130
GET '/index/projects/f6133d2a-9f3d-4ef9-9c19-c23d6c7e6cc0' 2024-01-15 13:50:00.000 2
HEAD '/index/samples' 2024-01-15 13:50:00.000 2
HEAD '/index/bundles' 2024-01-15 13:50:00.000 1
GET '/index/projects' 2024-01-15 13:50:00.000 1

EDIT: (@achave11-ucsc) Preceding this edit, the reference to the culprit for these false positives was incorrect, it's now fixed.

achave11-ucsc commented 10 months ago

FWIW, there are no logs matching task timeouts in dev (account mentioned in issue description).

dsotirho-ucsc commented 9 months ago

@hannes-ucsc: "As written, the above comments don't make sense. The HEAD request does not invoke TDR DRS endpoints. Assignee to clarify."

achave11-ucsc commented 9 months ago

Reference to the culprit of the false positive HEAD task timeout executions in prod has been updated, now pointing at the correct one (https://github.com/DataBiosphere/azul/issues/5555#issuecomment-1889983058).

dsotirho-ucsc commented 9 months ago

@hannes-ucsc: "According to the spike results the prevalence is much reduced. Addressing #5555 will hopefully eliminate the remaining occurrences. Because of that and despite PR #5479 being partial, there is nothing else we need to do for this."