Closed mmarcotte closed 8 months ago
Related to devex 1236
/maintenance-mode
). Validation for the fix, after the DB upgrade, will be revealed when 1236-devex-pr-to-staging is merged.Since the previous deployment, there have been a number of 504 errors, but they are all related to a newly created bug. The intermittent ones on various endpoints appears to have been solved with the upgrade of DynamoDB to v3 and tuning of its config.
Describe the Bug
We are observing a small number of 504s happening on various endpoints. This is causing the application to fail because it was expecting a healthy status code of 200 with the information it was requesting. Work has been done previously to improve reliability, but a problem still exists and warrants further investigation.
We observed this happening while running Smoketests, but we have also observed them in the Kibana logs as well.
The 504s are occurring because the Lambda is using the full 29 seconds, and API Gateway gives up and returns a 504.
It's unclear why the Lambdas are using the full 29 seconds. When we instantiate the DynamoDB client, we are specifying config to make requests timeout after 5 seconds of inactivity. And after that timeout, it's configured to retry a maximum of 3 times. This should get the job done within the allotted 29 seconds unless multiple failures are occurring, which is unlikely for a service with 99%+ uptime.
Config in ApplicationContext
We are observing this happen more frequently on some of the URLs that are called more frequently suggesting that every time we make an API call there is some low chance it happens. So it happens more often on the endpoints we call more often.
Here is a summary of 504s from Production over the last 30 days:
Here's a handy resource:
https://seed.run/blog/how-to-fix-dynamodb-timeouts-in-serverless-application.html
Business Impact/Reason for Severity
In which environment did you see this bug?
Staging / Production
Who were you logged in as?
Smoketests were running / Reviewing Kibana logs
What were you doing when you discovered this bug? (Using the application, demoing, smoke tests, testing other functionality, etc.)
Running smoketests. It failed when an API Endpoint (
/api/notifications
) timed out. A subsequent retry passed.To Reproduce
Unfortunately, you cannot reliably replicate this bug. It is due to the uptime of AWS. Your best bet is to continuously try to query
/maintenance-mode
or/notifications
and eventually one of the requests will 504.Expected Behavior
These requests should not 504. They should return quickly. Under the hood, the DynamoDB HTTP Requests should fail fast and be retried.
Actual Behavior
Lambdas that usually take milliseconds to respond are taking 29 seconds. It is suspected that under the hood the DynamoDB HTTP Requests are not timing out.
Screenshots
Cause of Bug, If Known
It's suspected that the DynamoDB configuration is falling back to the default configuration that has a timeout of 2 minutes. If this is the case, then requests would never get retried within the context of a 29s Lambda
Process for Logging a Bug:
Severity Definition:
Critical Defect Blocks entire system's or module’s functionality No workarounds available Testing cannot proceed further without bug being fixed.
High-severity Defect Affects key functionality of an application There's a workaround, but not obvious or easy App behaves in a way that is strongly different from the one stated in the requirements
Medium-severity Defect A minor function does not behave in a way stated in the requirements. Workaround is available and easy
Low-severity Defect Mostly related to an application’s UI Doesn't need a workaround, because it doesn't impact functionality
Definition of Ready for Bugs(Created 10-4-21)
Definition used: A failure or flaw in the system which produces an incorrect or undesired result that deviates from the expected result or behavior. (Note: Expected results are use cases that have been documented in past user stories as acceptance criteria and test cases, and do not include strange behavior unrelated to use cases.)
The following criteria must be met in order for the development team to begin work on the bug.
The bug must:
Process: If the unexpected results are new use cases that have been identified, but not yet built, new acceptance criteria and test cases should be captured in a new user story and prioritized by the product owner.
If the Court is not able to reproduce the bug, add the “Unable to reproduce” tag. This will provide visibility into the type of support that may be needed by the Court. In the event that the Court cannot reproduce the bug, the Court will work with Flexion to communicate what type of troubleshooting help may be needed.
Definition of Done (Updated 4-14-21)
Product Owner
Engineering
test
environment if prod-like data is required. Otherwise, deployed to anyexperimental
environment for review.