DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
6 stars 2 forks source link

Service Lambda timeouts cause user-facing 5xx responses #6284

Open achave11-ucsc opened 3 months ago

achave11-ucsc commented 3 months ago

… during dynamodb.GetItem operation. CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-service-prod


@timestamp @message
2024-05-21 19:48:44.731 START RequestId: ea08f3a5-c8cc-4d54-a39b-f1d19fd691b0 Version: $LATEST
2024-05-21 19:48:44.732 [INFO] 2024-05-21T19:48:44.731Z ea08f3a5-c8cc-4d54-a39b-f1d19fd691b0 azul.chalice Received POST request for '/index/projects', with {"query": {"size": "100"}, "headers": {"accept": "application/json", "accept-encoding": "gzip;q=1.0,deflate;q=0.6,identity;q=0.3", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-asn": "396982", "cloudfront-viewer-country": "US", "content-type": "application/json", "host": "service.azul.data.humancellatlas.org", "user-agent": "rest-client/2.1.0 (linux x86_64) ruby/3.1.3p185", "via": "1.1 80d44f43db721c4fa16ae1b35851da00.cloudfront.net (CloudFront)", "x-amz-cf-id": "qDWMtf_kCmO3r7oE6YZmbu7ZWsNdZs8Lj3VXb4BMo71-l4iFJDzOxg==", "x-amzn-trace-id": "Root=1-664cfa9c-41e92e1a1c3836133c007a40", "x-app-id": "single-cell-portal", "x-domain-id": "singlecell.broadinstitute.org", "x-forwarded-for": "34.67.221.52, 15.158.2.77", "x-forwarded-port": "443", "x-forwarded-proto": "https"}}.
2024-05-21 19:48:44.732 [INFO] 2024-05-21T19:48:44.732Z ea08f3a5-c8cc-4d54-a39b-f1d19fd691b0 azul.chalice Did not authenticate request.
2024-05-21 19:48:44.733 [INFO] 2024-05-21T19:48:44.733Z ea08f3a5-c8cc-4d54-a39b-f1d19fd691b0 azul.boto3 dynamodb.GetItem: Making POST request to https://dynamodb.us-east-1.amazonaws.com/
2024-05-21 19:48:44.733 [INFO] 2024-05-21T19:48:44.733Z ea08f3a5-c8cc-4d54-a39b-f1d19fd691b0 azul.boto3 dynamodb.GetItem: … with request body b'{"TableName": "azul-sources_cache_by_auth-prod", "Key": {"identity": {"S": "dcp37:"}}, "ProjectionExpression": "sources,expir...'
2024-05-21 19:49:15.771 2024-05-21T19:49:15.771Z ea08f3a5-c8cc-4d54-a39b-f1d19fd691b0 Task timed out after 31.04 seconds
2024-05-21 19:49:15.771 END RequestId: ea08f3a5-c8cc-4d54-a39b-f1d19fd691b0
2024-05-21 19:49:15.771 REPORT RequestId: ea08f3a5-c8cc-4d54-a39b-f1d19fd691b0 Duration: 31040.48 ms Billed Duration: 31000 ms Memory Size: 2048 MB Max Memory Used: 234 MB

Which leads to error alarms. This type of errors had decrease significantly after https://github.com/DataBiosphere/azul/issues/5467#issuecomment-1912572347, but now appear to be making a comeback.

Note that two similar issues that report execution errors leading to alarms already exists as #5622 and #6097, however their fix would be ineffective for this type of error. The former makes the forwarder Lambda execution more resilient by increasing the error retry to two instead of zero, as well as by bumping the alarm error threshold to one per day in the same Lam. The latter increases the alarm error threshold in the cache health Lambdas to one per day.

achave11-ucsc commented 2 months ago

Assignee to consider next steps.

hannes-ucsc commented 2 months ago

The response should really be a 503 or a 502. Spike to determine what we actually returned in this case, using the API Gateway execution or access logs.

Based on prior support responses, Amazon's position on this is that we need to expect a certain percentage of lambda invocations to randomly fail or time out. The SLA says:

AWS will use commercially reasonable efforts to make Lambda available with a Monthly Uptime Percentage for each AWS region, during any monthly billing cycle, of at least 99.95% (the “Service Commitment”). In the event Lambda does not meet the Service Commitment, you will be eligible to receive a Service Credit as described below.

Those 99.95% translate to (1-0.9995)360024*30 = 1296 seconds of downtime each month. This "uptime" guarantee isn't exactly translatable into an error rate for Lambda invocations. A 1ms outage could kill an entire Lambda invocation of 15min. Counting that scenario as a 1ms outage would clearly give a lot of leeway to Amazon. In a more balanced interpretation, the average duration of an invocation of our service Lambda function is 1s in prod. The average concurrency is 1 on most days though there are days where it is around 5. Assuming a concurrency of 1, we should expect 1296 failed invocations per month, or 43 per day. We're not seeing nearly as many, less than one per day.

We could of course increase the alarm threshold but this issue isn't prevalent enough for that. If we set the threshold to 1/day we would suppress one potentially legitimate alarm every day that we don't get this type of error. Desensitizing the alarm would therefore require a much higher average load, producing a more steady incident rate of this type of outage.

At the conclusion of the spike, we'll likely leave this open but put it in the Icebox column.

achave11-ucsc commented 2 months ago

The returned status code for this request was a 504, according to the matching API Gateway execution logs.

CloudWatch Logs Insights
region: us-east-1
log-group-names: API-Gateway-Execution-Logs_ccp122p730/prod, /aws/lambda/azul-service-prod, /aws/apigateway/azul-service-prod
start-time: 2024-05-21T19:48:43.760Z
end-time: 2024-05-21T19:49:16.732Z
query-string:

  fields @timestamp, @message, @log
| filter @message like /3a5aa225|ea08f3a5/
| sort @timestamp asc
| limit 1000

@timestamp @message @log
2024-05-21 19:48:44.709 {"accountId": "542754589326", "apiId": "ccp122p730", "authorizer_claims_property": "-", "authorizer_error": "-", "authorizer_principalId": "-", "authorizer_property": "-", "awsEndpointRequestId": "-", "awsEndpointRequestId2": "-", "customDomain_basePathMatched": "(none)", "dataProcessed": "-", "domainName": "service.azul.data.humancellatlas.org", "domainPrefix": "service", "error_message": "Endpoint request timed out", "error_messageString": " "Endpoint request timed out"", "error_responseType": "INTEGRATION_TIMEOUT", "extendedRequestId": "YIwYiFstIAMESqA=", "httpMethod": "POST", "identity_accountId": "-", "identity_caller": "-", "identity_cognitoAuthenticationProvider": "-", "identity_cognitoAuthenticationType": "-", "identity_cognitoIdentityId": "-", "identity_cognitoIdentityPoolId": "-", "identity_principalOrgId": "-", "identity_clientCert_clientCertPem": "-", "identity_clientCert_subjectDN": "-", "identity_clientCert_issuerDN": "-", "identity_clientCert_serialNumber": "-", "identity_clientCert_validity_notBefore": "-", "identity_clientCert_validity_notAfter": "-", "identity_sourceIp": "34.67.221.52", "identity_user": "-", "identity_userAgent": "rest-client/2.1.0 (linux x86_64) ruby/3.1.3p185", "identity_userArn": "-", "integration_error": "Execution failed due to a timeout error", "integration_integrationStatus": "-", "integration_latency": "29000", "integration_requestId": "-", "integration_status": "-", "integrationErrorMessage": "-", "integrationLatency": "29000", "integrationStatus": "-", "path": "/index/projects", "protocol": "HTTP/1.1", "requestId": "3a5aa225-07bb-441a-8526-3b4e85b84132", "requestTime": "21/May/2024:19:48:44 +0000", "requestTimeEpoch": "1716320924709", "responseLatency": "29013", "responseLength": "41", "routeKey": "-", "stage": "prod", "status": "504"} 542754589326:/aws/apigateway/azul-service-prod
2024-05-21 19:48:44.709 (3a5aa225-07bb-441a-8526-3b4e85b84132) Extended Request Id: YIwYiFstIAMESqA= 542754589326:API-Gateway-Execution-Logs_ccp122p730/prod
2024-05-21 19:48:44.719 (3a5aa225-07bb-441a-8526-3b4e85b84132) Verifying Usage Plan for request: 3a5aa225-07bb-441a-8526-3b4e85b84132. API Key: API Stage: ccp122p730/prod 542754589326:API-Gateway-Execution-Logs_ccp122p730/prod
2024-05-21 19:48:44.721 (3a5aa225-07bb-441a-8526-3b4e85b84132) API Key authorized because method 'POST /index/{entity_type}' does not require API Key. Request will not contribute to throttle or quota limits 542754589326:API-Gateway-Execution-Logs_ccp122p730/prod
2024-05-21 19:48:44.721 (3a5aa225-07bb-441a-8526-3b4e85b84132) Usage Plan check succeeded for API Key and API Stage ccp122p730/prod 542754589326:API-Gateway-Execution-Logs_ccp122p730/prod
2024-05-21 19:48:44.722 (3a5aa225-07bb-441a-8526-3b4e85b84132) Starting execution for request: 3a5aa225-07bb-441a-8526-3b4e85b84132 542754589326:API-Gateway-Execution-Logs_ccp122p730/prod
2024-05-21 19:48:44.722 (3a5aa225-07bb-441a-8526-3b4e85b84132) HTTP Method: POST, Resource Path: /index/projects 542754589326:API-Gateway-Execution-Logs_ccp122p730/prod
2024-05-21 19:48:44.731 START RequestId: ea08f3a5-c8cc-4d54-a39b-f1d19fd691b0 Version: $LATEST 542754589326:/aws/lambda/azul-service-prod
2024-05-21 19:48:44.732 [INFO] 2024-05-21T19:48:44.731Z ea08f3a5-c8cc-4d54-a39b-f1d19fd691b0 azul.chalice Received POST request for '/index/projects', with {"query": {"size": "100"}, "headers": {"accept": "application/json", "accept-encoding": "gzip;q=1.0,deflate;q=0.6,identity;q=0.3", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-asn": "396982", "cloudfront-viewer-country": "US", "content-type": "application/json", "host": "service.azul.data.humancellatlas.org", "user-agent": "rest-client/2.1.0 (linux x86_64) ruby/3.1.3p185", "via": "1.1 80d44f43db721c4fa16ae1b35851da00.cloudfront.net (CloudFront)", "x-amz-cf-id": "qDWMtf_kCmO3r7oE6YZmbu7ZWsNdZs8Lj3VXb4BMo71-l4iFJDzOxg==", "x-amzn-trace-id": "Root=1-664cfa9c-41e92e1a1c3836133c007a40", "x-app-id": "single-cell-portal", "x-domain-id": "singlecell.broadinstitute.org", "x-forwarded-for": "34.67.221.52, 15.158.2.77", "x-forwarded-port": "443", "x-forwarded-proto": "https"}}. 542754589326:/aws/lambda/azul-service-prod
2024-05-21 19:48:44.732 [INFO] 2024-05-21T19:48:44.732Z ea08f3a5-c8cc-4d54-a39b-f1d19fd691b0 azul.chalice Did not authenticate request. 542754589326:/aws/lambda/azul-service-prod
2024-05-21 19:48:44.733 [INFO] 2024-05-21T19:48:44.733Z ea08f3a5-c8cc-4d54-a39b-f1d19fd691b0 azul.boto3 dynamodb.GetItem: Making POST request to https://dynamodb.us-east-1.amazonaws.com/ 542754589326:/aws/lambda/azul-service-prod
2024-05-21 19:48:44.733 [INFO] 2024-05-21T19:48:44.733Z ea08f3a5-c8cc-4d54-a39b-f1d19fd691b0 azul.boto3 dynamodb.GetItem: … with request body b'{"TableName": "azul-sources_cache_by_auth-prod", "Key": {"identity": {"S": "dcp37:"}}, "ProjectionExpression": "sources,expir...' 542754589326:/aws/lambda/azul-service-prod
2024-05-21 19:49:13.722 (3a5aa225-07bb-441a-8526-3b4e85b84132) Execution failed due to a timeout error 542754589326:API-Gateway-Execution-Logs_ccp122p730/prod
2024-05-21 19:49:13.722 (3a5aa225-07bb-441a-8526-3b4e85b84132) Method completed with status: 504 542754589326:API-Gateway-Execution-Logs_ccp122p730/prod
2024-05-21 19:49:15.771 2024-05-21T19:49:15.771Z ea08f3a5-c8cc-4d54-a39b-f1d19fd691b0 Task timed out after 31.04 seconds 542754589326:/aws/lambda/azul-service-prod
2024-05-21 19:49:15.771 END RequestId: ea08f3a5-c8cc-4d54-a39b-f1d19fd691b0 542754589326:/aws/lambda/azul-service-prod
2024-05-21 19:49:15.771 REPORT RequestId: ea08f3a5-c8cc-4d54-a39b-f1d19fd691b0 Duration: 31040.48 ms Billed Duration: 31000 ms Memory Size: 2048 MB Max Memory Used: 234 MB 542754589326:/aws/lambda/azul-service-prod

achave11-ucsc commented 2 months ago

Assignee to consider next steps.

hannes-ucsc commented 1 month ago

We'll document in the OpenAPI spec that requests yielding a 504 status code should be retried. Similarly, we'll add a body to the canned API Gateway response for 504 stating that the event should be retried after obeying the delay specified in the Retry-After header. Lastly, we'll add the Retry-After: 10 header to that response.

dsotirho-ucsc commented 3 weeks ago

@hannes-ucsc: "We are still getting occasional alarms about timeouts on the service and indexer REST API Lambdas. So to fully address this we need to deal with those as well. Assignee to also inline the default period and threshold for metric alarms so that the call sites can't mix an explicit value with a default one (this means removing the default from the attrs property declaration). Assignee to then increase the period for the service and indexer REST API Lambdas to one per day."