BUG: Intermittent 504s occurring on various endpoints

mmarcotte commented 9 months ago

Describe the Bug

We are observing a small number of 504s happening on various endpoints. This is causing the application to fail because it was expecting a healthy status code of 200 with the information it was requesting. Work has been done previously to improve reliability, but a problem still exists and warrants further investigation.

We observed this happening while running Smoketests, but we have also observed them in the Kibana logs as well.

The 504s are occurring because the Lambda is using the full 29 seconds, and API Gateway gives up and returns a 504.

It's unclear why the Lambdas are using the full 29 seconds. When we instantiate the DynamoDB client, we are specifying config to make requests timeout after 5 seconds of inactivity. And after that timeout, it's configured to retry a maximum of 3 times. This should get the job done within the allotted 29 seconds unless multiple failures are occurring, which is unlikely for a service with 99%+ uptime.

Config in ApplicationContext

dynamoCache[type] = new DynamoDB({
      endpoint: useMasterRegion
        ? environment.masterDynamoDbEndpoint
        : environment.dynamoDbEndpoint,
      httpOptions: {
        connectTimeout: 3000,
        timeout: 5000,
      },
      maxRetries: 3,
      region: useMasterRegion ? environment.masterRegion : environment.region,
    });

We are observing this happen more frequently on some of the URLs that are called more frequently suggesting that every time we make an API call there is some low chance it happens. So it happens more often on the endpoints we call more often.

Here is a summary of 504s from Production over the last 30 days:

Screenshot 2023-11-22 at 10 25 50 AM

Here's a handy resource:

https://seed.run/blog/how-to-fix-dynamodb-timeouts-in-serverless-application.html

Business Impact/Reason for Severity

In which environment did you see this bug?

Staging / Production

Who were you logged in as?

Smoketests were running / Reviewing Kibana logs

What were you doing when you discovered this bug? (Using the application, demoing, smoke tests, testing other functionality, etc.)

Running smoketests. It failed when an API Endpoint (/api/notifications) timed out. A subsequent retry passed.

To Reproduce

Unfortunately, you cannot reliably replicate this bug. It is due to the uptime of AWS. Your best bet is to continuously try to query /maintenance-mode or /notifications and eventually one of the requests will 504.

Expected Behavior

These requests should not 504. They should return quickly. Under the hood, the DynamoDB HTTP Requests should fail fast and be retried.

Actual Behavior

Lambdas that usually take milliseconds to respond are taking 29 seconds. It is suspected that under the hood the DynamoDB HTTP Requests are not timing out.

Screenshots

Screenshot 2023-11-22 at 10 30 13 AM

Cause of Bug, If Known

It's suspected that the DynamoDB configuration is falling back to the default configuration that has a timeout of 2 minutes. If this is the case, then requests would never get retried within the context of a 29s Lambda

Process for Logging a Bug:

Complete the above information
Add a severity tag (Critical, High Severity, Medium Severity or Low Severity). See below for priority definition.

Severity Definition:

Critical Defect Blocks entire system's or module’s functionality No workarounds available Testing cannot proceed further without bug being fixed.
High-severity Defect Affects key functionality of an application There's a workaround, but not obvious or easy App behaves in a way that is strongly different from the one stated in the requirements
Medium-severity Defect A minor function does not behave in a way stated in the requirements. Workaround is available and easy
Low-severity Defect Mostly related to an application’s UI Doesn't need a workaround, because it doesn't impact functionality

Definition of Ready for Bugs(Created 10-4-21)

Definition used: A failure or flaw in the system which produces an incorrect or undesired result that deviates from the expected result or behavior. (Note: Expected results are use cases that have been documented in past user stories as acceptance criteria and test cases, and do not include strange behavior unrelated to use cases.)

The following criteria must be met in order for the development team to begin work on the bug.

The bug must:

Be focused on solving a user problem
Contain data for all fields in the bug template, so the team can pick it up and begin working immediately

Process: If the unexpected results are new use cases that have been identified, but not yet built, new acceptance criteria and test cases should be captured in a new user story and prioritized by the product owner.

If the Court is not able to reproduce the bug, add the “Unable to reproduce” tag. This will provide visibility into the type of support that may be needed by the Court. In the event that the Court cannot reproduce the bug, the Court will work with Flexion to communicate what type of troubleshooting help may be needed.

Definition of Done (Updated 4-14-21)

Product Owner

[ ] Bug fix has been validated in the Court's test environment

Engineering

[ ] Automated test scripts have been written
[ ] Field level and page level validation errors (front-end and server-side) integrated and functioning
[ ] Verify that language for docket record for internal users and external users is identical
[ ] New screens have been added to pa11y scripts
[ ] All new functionality verified to work with keyboard and macOS voiceover https://www.apple.com/voiceover/info/guide/_1124.html
[ ] READMEs, other appropriate docs and swagger/APIs fully updated
[ ] UI should be touch optimized and responsive for external only (functions on supported mobile devices and optimized for screen sizes as required)
[ ] Interactors should validate entities before calling persistence methods
[ ] Code refactored for clarity and to remove any known technical debt
[ ] Deployed to the Court's test environment if prod-like data is required. Otherwise, deployed to any experimental environment for review.

zachrog commented 9 months ago

Related to devex 1236

Absolutestunna commented 9 months ago

Investigations and data gathering for the intermittent 504 failures on endpoints (lambdas) querying DynamoDB (v3) are currently underway with the 1236-devex work in the Test environment.
The experiment to load testing endpoints (or regular usage in courts upper environments) querying DynamoDB v2 vs v3 may not be profitable at this moment considering the rate at which the failures occurs is low at this point (for v2, /maintenance-mode). Validation for the fix, after the DB upgrade, will be revealed when 1236-devex-pr-to-staging is merged.

mmarcotte commented 8 months ago

Since the previous deployment, there have been a number of 504 errors, but they are all related to a newly created bug. The intermittent ones on various endpoints appears to have been solved with the upgrade of DynamoDB to v3 and tuning of its config.

flexion / ef-cms

BUG: Intermittent 504s occurring on various endpoints #10198

Definition of Ready for Bugs(Created 10-4-21)

Definition of Done (Updated 4-14-21)