department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 198 forks source link

Near No-Downtime Upgrades: Discovery and Proposal for Upgrade Strategy, Redis #80969

Open jennb33 opened 5 months ago

jennb33 commented 5 months ago

Issue Description

AWS offers near-zero downtime Redis upgrades. For reasons specific to our platform, we are currently unable to take advantage of this and an upgrade on our platform currently requires about two hours of downtime. The reasons for this mainly stem from the fact that we don't own all of the code in vets-api, meaning, we're unclear about the nature of the data in our clusters. Can it be reproduced in case data gets lost? Are writes to the cluster robust, in case the cluster is unavailable for any amount of time?

This has led us to upgrade paths that take the entire application stack offline, as documented for our most current upgrade

The goal of this ticket is to review what is needed for the next Redis Upgrade, and to provide a strategy to the OCTO POs for implementing that upgrade and meeting the Redis needs (not the wants) of OCTO via that upgrade. Some research has been done on the Elasticache Clusters; it might be worth seeing if this is a priority to include in the next upgrade. Need to determine if this is a need or a want. Because the goal is zero-downtime, the clusters work might be a necessary inclusion.

Tasks

Success Metrics

Describe what success looks like for this work. Define specific, measurable outcomes that indicate success.

Acceptance Criteria

Validation

Assignee to add steps to this section. List the actions that need to be taken to confirm this issue is complete. Include any necessary links or context. State the expected outcome.

jennb33 commented 1 month ago

7/25 update: @Kshitiz-devops verified that no downtime is required for Postgres (which uses a blue-green upgrade strategy)