department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
284 stars 206 forks source link

Near No-Downtime Upgrades / Sharded Upgrades - Discovery #84213

Open jennb33 opened 5 months ago

jennb33 commented 5 months ago

As the Owners of the Platform, We need to understand what the options are for near no-downtime/sharded upgrades of our tools, So that there is near zero disruptions (in the form of downtime) of the Platform for users when we have upgrades that need to be implemented.

Issue Description

Need to understand the options for doing near no-downtime upgrades for the Platform Product team and the current tools that are currently in operation. This is the discovery Epic to throughly investigate what is available by using our current tools, or to review other options on the market for successfully implementing a near zero-downtime option for future upgrades.

AWS offers near-zero downtime Redis upgrades and for reasons specific to our platform, we are currently unable to take advantage of this. Therefore an upgrade on our platform currently requires about two hours of downtime. The reasons for this mainly stem from the fact that the Platform Product team doesn't own all of the code in vets-api, and therefore lack a clear understanding regarding the data in our clusters. Is it known if it be reproduced in case data gets lost? Are writes to the cluster robust, in case the cluster is unavailable for any amount of time?

This has led us to upgrade paths that take the entire application stack offline, as documented for our most current upgrade

The goal of the stories epic is to scrutinize our Redis clusters and make changes to them, to the extent possible, such that we can take advantage of AWS upgrade paths, or minimize downtime for clusters that can't follow these paths (sidekiq, might be an example).

Tasks

Success Metrics

Acceptance Criteria

OKR

O1. Our digital experiences are the best way to access VA health care and benefits OKR 3: All new products have a faster transaction time than those they replaced

Other Stories Worth Considering

Aside from the stories in this epic, is it worth exploring the following:

Validation

Assignee to add steps to this section. List the actions that need to be taken to confirm this issue is complete. Include any necessary links or context. State the expected outcome.

### Tasks
- [ ] https://github.com/department-of-veterans-affairs/va.gov-team/issues/80969
- [ ] https://github.com/department-of-veterans-affairs/va.gov-team/issues/85606
- [ ] https://github.com/department-of-veterans-affairs/va.gov-team/issues/85607
- [ ] https://github.com/department-of-veterans-affairs/va.gov-team/issues/89355
- [ ] https://github.com/department-of-veterans-affairs/va.gov-team/issues/89356
- [ ] https://github.com/department-of-veterans-affairs/va.gov-team/issues/78036
- [ ] https://github.com/department-of-veterans-affairs/va.gov-team/issues/78048
### Tasks
flooose commented 5 months ago

We'll need stories similar to Discovery: dsva-vagov-rails-cache for the other two redis instances.