Near No-Downtime Upgrades: Discovery and Proposal for Upgrade Strategy, Redis

Issue Description

AWS offers near-zero downtime Redis upgrades. For reasons specific to our platform, we are currently unable to take advantage of this and an upgrade on our platform currently requires about two hours of downtime. The reasons for this mainly stem from the fact that we don't own all of the code in vets-api, meaning, we're unclear about the nature of the data in our clusters. Can it be reproduced in case data gets lost? Are writes to the cluster robust, in case the cluster is unavailable for any amount of time?

This has led us to upgrade paths that take the entire application stack offline, as documented for our most current upgrade

The goal of this ticket is to review what is needed for the next Redis Upgrade, and to provide a strategy to the OCTO POs for implementing that upgrade and meeting the Redis needs (not the wants) of OCTO via that upgrade. Some research has been done on the Elasticache Clusters; it might be worth seeing if this is a priority to include in the next upgrade. Need to determine if this is a need or a want. Because the goal is zero-downtime, the clusters work might be a necessary inclusion.

Tasks

[ ] Determine what the functional needs are that Redis is not currently meeting.
[ ] Determine what the technical needs are that Redis is not currently meeting.
[ ] Determine what the Level of Effort will be to meet these needs
[ ] Create a proposal that includes implementation strategy for internal team review
[ ] Finalize proposal for presentation to OCTO POs

Success Metrics

Describe what success looks like for this work. Define specific, measurable outcomes that indicate success.

Acceptance Criteria

[ ] Proposal and implementation strategy for Redis Upgrade

Validation

Assignee to add steps to this section. List the actions that need to be taken to confirm this issue is complete. Include any necessary links or context. State the expected outcome.

department-of-veterans-affairs / va.gov-team