Near No-Downtime Upgrades / Sharded Upgrades - Discovery

As the Owners of the Platform, We need to understand what the options are for near no-downtime/sharded upgrades of our tools, So that there is near zero disruptions (in the form of downtime) of the Platform for users when we have upgrades that need to be implemented.

Issue Description

Need to understand the options for doing near no-downtime upgrades for the Platform Product team and the current tools that are currently in operation. This is the discovery Epic to throughly investigate what is available by using our current tools, or to review other options on the market for successfully implementing a near zero-downtime option for future upgrades.

AWS offers near-zero downtime Redis upgrades and for reasons specific to our platform, we are currently unable to take advantage of this. Therefore an upgrade on our platform currently requires about two hours of downtime. The reasons for this mainly stem from the fact that the Platform Product team doesn't own all of the code in vets-api, and therefore lack a clear understanding regarding the data in our clusters. Is it known if it be reproduced in case data gets lost? Are writes to the cluster robust, in case the cluster is unavailable for any amount of time?

This has led us to upgrade paths that take the entire application stack offline, as documented for our most current upgrade

Documentation
- Risk Assessment
- Update Outline and Steps

The goal of the stories epic is to scrutinize our Redis clusters and make changes to them, to the extent possible, such that we can take advantage of AWS upgrade paths, or minimize downtime for clusters that can't follow these paths (sidekiq, might be an example).

Tasks

[ ] Review current tools for near zero-downtime option usage when upgrading (Redis?)
[ ] Review other tools on the market that would provide zero-downtime
[ ] Risk assessment comparing all tools
[ ] Cost analysis comparing all tools
[ ] Feasibility study comparing all tools
[ ] Other tasks as necessary

Success Metrics

[ ] Risk assessment complete
[ ] Cost analysis complete
[ ] Feasibility study complete

Acceptance Criteria

[ ] Final cost analysis ready for review by OCTO PO/TL
[ ] Final risk assessment ready for review by OCTO PO/TL
[ ] Final feasibility study ready for review by OCTO PO/TL

OKR

O1. Our digital experiences are the best way to access VA health care and benefits OKR 3: All new products have a faster transaction time than those they replaced

Validation

Assignee to add steps to this section. List the actions that need to be taken to confirm this issue is complete. Include any necessary links or context. State the expected outcome.

### Tasks
- [ ] https://github.com/department-of-veterans-affairs/va.gov-team/issues/80969
- [ ] https://github.com/department-of-veterans-affairs/va.gov-team/issues/85606
- [ ] https://github.com/department-of-veterans-affairs/va.gov-team/issues/85607
- [ ] https://github.com/department-of-veterans-affairs/va.gov-team/issues/89355
- [ ] https://github.com/department-of-veterans-affairs/va.gov-team/issues/89356
- [ ] https://github.com/department-of-veterans-affairs/va.gov-team/issues/78036
- [ ] https://github.com/department-of-veterans-affairs/va.gov-team/issues/78048

### Tasks

department-of-veterans-affairs / va.gov-team