flexion / ef-cms

An Electronic Filing / Case Management System.
23 stars 10 forks source link

Health Check Reliability #10116

Closed TomElliottFlexion closed 1 year ago

TomElliottFlexion commented 1 year ago

As a maintainer of DAWSON, I need the health check endpoint to return quickly enough not to cause timeouts on the status_health_check_west and status_health_check_east Route53 health check. As a maintainer of DAWSON, I need to be able to successfully re-route traffic between east and west regions when API Gateway or Lambda have gone offline in the region.

Pre-Conditions

https://app.zenhub.com/workspaces/flexionef-cms-5bbe4bed4b5806bc2bec65d3/issues/gh/flexion/ef-cms/10069

Acceptance Criteria

Notes

Potential solutions

Tasks

Test Cases

Story Definition of Ready (updated on 12/23/22)

The following criteria must be met in order for the user story to be picked up by the Flexion development team. The user story must:

Definition of Done (Updated 5-19-22)

Product Owner

UX

Engineering

zachrog commented 1 year ago

After a discussion between Mike, Jim, Tom, Zach, and Chris this story has been modified in service of a larger epic of making the DAWSON system more reliable/redundant. The broad strokes to achieving that epic are:

  1. Create a DNS failover system that works for basic outages like Lamda and API Gateway.
  2. Create a SPIKE to discover what infrastructure pieces we have in place for DAWSON, whether or not they are duplicated across regions, what it would take to make the infrastructure duplicated and capable of failing over to another region, the cost of duplicating resources, and the historic uptime of each system to assess its risk to the project.
  3. Create redundant systems capable of accepting traffic from a failover, and create a monitoring system which is capable of redirecting that traffic to the healthy system.