Near No-Downtime Upgrades: Investigate Elasticache/Redis Auto Upgrade Downtime

LindseySaari commented 8 months ago

Description

We encountered an unexpected downtime of 6 minutes during an auto-upgrade process for Elasticache/Redis. Typically, auto-upgrades are triggered in response to CVE's, ensuring security related upgrades are promptly integrated. However, this particular downtime instance exceeded the expected duration and disrupted operations.

Tasks

[ ] Research Downtime Cause. Can we incorporate an alternative feature set to obtain minimal write downtime during auto upgrades?
[ ] AWS Documentation Analysis: Assess whether our current setup aligns with AWS documentation
[ ] Upgrade/Mitigation Plan: Develop strategies to minimize downtime during future auto-upgrade
[ ] Update Documentation

Success Metrics

[ ] ADD SUCCESS METRICS HERE

Acceptability Criteria

[ ] ADD A/C HERE

jennb33 commented 2 months ago

9/5/2024 update: we are moving this ticket to the Sharded/Non-Downtime objective.

flooose commented 2 months ago

I removed the reference to the POAM from this to keep us from getting confused in the future.

department-of-veterans-affairs / va.gov-team