department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
99 stars 68 forks source link

Discovery: Investigate 4xx Errors that cause the prod.cms.va.gov EC2 Instance to Refresh After Hours #18865

Open 7hunderbird opened 2 months ago

7hunderbird commented 2 months ago

User Story or Problem Statement

Our team needs to investigate why the load balancer that sits above the prod.cms.va.gov EC2 instance has alerted.

Moreover, that it appears that upon initial investigation the prod.cms.va.gov EC2 instance had automatically replaced itself (through an Auto Scaling Group lifecycle event) at the time that the load balancer alerts happened.

Description or Additional Context

On Aug 7th @ph-One made our team aware of a monitoring alert that happened after 8PM ET on Aug 6th.

The "Platform - AWS Application Load Balancer % of HTTP 4xx Responses" alert tracks if there has been a higher than 25% amount of 4xx responses downstream from the load balancer. This means that anything the load balancer is trying to reach it's failed to make a connection to those resources.

400 series error code can have a variety of issues including:

In any case, if a Load Balancer can't find what it expects to find behind it, then it's important to know about.

Steps for Implementation

Acceptance Criteria

7hunderbird commented 2 months ago

CleanShot 2024-08-07 at 09 53 35

gracekretschmer-metrostar commented 2 months ago

Next steps: create a BRD epic for CMS infrastructure research and small tasks for improvements. Pull in Tyler's tasks into that epic.