flexion / ef-cms

An Electronic Filing / Case Management System.
23 stars 10 forks source link

BUG: Intermittent 504s occurring on various endpoints #10198

Closed mmarcotte closed 8 months ago

mmarcotte commented 9 months ago

Describe the Bug

We are observing a small number of 504s happening on various endpoints. This is causing the application to fail because it was expecting a healthy status code of 200 with the information it was requesting. Work has been done previously to improve reliability, but a problem still exists and warrants further investigation.

We observed this happening while running Smoketests, but we have also observed them in the Kibana logs as well.

The 504s are occurring because the Lambda is using the full 29 seconds, and API Gateway gives up and returns a 504.

It's unclear why the Lambdas are using the full 29 seconds. When we instantiate the DynamoDB client, we are specifying config to make requests timeout after 5 seconds of inactivity. And after that timeout, it's configured to retry a maximum of 3 times. This should get the job done within the allotted 29 seconds unless multiple failures are occurring, which is unlikely for a service with 99%+ uptime.

Config in ApplicationContext

dynamoCache[type] = new DynamoDB({
      endpoint: useMasterRegion
        ? environment.masterDynamoDbEndpoint
        : environment.dynamoDbEndpoint,
      httpOptions: {
        connectTimeout: 3000,
        timeout: 5000,
      },
      maxRetries: 3,
      region: useMasterRegion ? environment.masterRegion : environment.region,
    });

We are observing this happen more frequently on some of the URLs that are called more frequently suggesting that every time we make an API call there is some low chance it happens. So it happens more often on the endpoints we call more often.

Here is a summary of 504s from Production over the last 30 days:

Screenshot 2023-11-22 at 10 25 50 AM

Here's a handy resource:

https://seed.run/blog/how-to-fix-dynamodb-timeouts-in-serverless-application.html

Business Impact/Reason for Severity

In which environment did you see this bug?

Staging / Production

Who were you logged in as?

Smoketests were running / Reviewing Kibana logs

What were you doing when you discovered this bug? (Using the application, demoing, smoke tests, testing other functionality, etc.)

Running smoketests. It failed when an API Endpoint (/api/notifications) timed out. A subsequent retry passed.

To Reproduce

Unfortunately, you cannot reliably replicate this bug. It is due to the uptime of AWS. Your best bet is to continuously try to query /maintenance-mode or /notifications and eventually one of the requests will 504.

Expected Behavior

These requests should not 504. They should return quickly. Under the hood, the DynamoDB HTTP Requests should fail fast and be retried.

Actual Behavior

Lambdas that usually take milliseconds to respond are taking 29 seconds. It is suspected that under the hood the DynamoDB HTTP Requests are not timing out.

Screenshots

Screenshot 2023-11-22 at 10 30 13 AM

Cause of Bug, If Known

It's suspected that the DynamoDB configuration is falling back to the default configuration that has a timeout of 2 minutes. If this is the case, then requests would never get retried within the context of a 29s Lambda

Process for Logging a Bug:

Severity Definition:

Definition of Ready for Bugs(Created 10-4-21)

Definition used: A failure or flaw in the system which produces an incorrect or undesired result that deviates from the expected result or behavior. (Note: Expected results are use cases that have been documented in past user stories as acceptance criteria and test cases, and do not include strange behavior unrelated to use cases.)

The following criteria must be met in order for the development team to begin work on the bug.

The bug must:

Process: If the unexpected results are new use cases that have been identified, but not yet built, new acceptance criteria and test cases should be captured in a new user story and prioritized by the product owner.

If the Court is not able to reproduce the bug, add the “Unable to reproduce” tag. This will provide visibility into the type of support that may be needed by the Court. In the event that the Court cannot reproduce the bug, the Court will work with Flexion to communicate what type of troubleshooting help may be needed.

Definition of Done (Updated 4-14-21)

Product Owner

Engineering

zachrog commented 9 months ago

Related to devex 1236

Absolutestunna commented 9 months ago
  1. Investigations and data gathering for the intermittent 504 failures on endpoints (lambdas) querying DynamoDB (v3) are currently underway with the 1236-devex work in the Test environment.
  2. The experiment to load testing endpoints (or regular usage in courts upper environments) querying DynamoDB v2 vs v3 may not be profitable at this moment considering the rate at which the failures occurs is low at this point (for v2, /maintenance-mode). Validation for the fix, after the DB upgrade, will be revealed when 1236-devex-pr-to-staging is merged.
mmarcotte commented 8 months ago
Screenshot 2023-12-29 at 10 41 03 AM

Since the previous deployment, there have been a number of 504 errors, but they are all related to a newly created bug. The intermittent ones on various endpoints appears to have been solved with the upgrade of DynamoDB to v3 and tuning of its config.