As the Owners of the Platform,
We need to understand what the options are for near no-downtime/sharded upgrades of our tools,
So that there is near zero disruptions (in the form of downtime) of the Platform for users when we have upgrades that need to be implemented.
Issue Description
Need to understand the options for doing near no-downtime upgrades for the Platform Product team and the current tools that are currently in operation. This is the discovery Epic to throughly investigate what is available by using our current tools, or to review other options on the market for successfully implementing a near zero-downtime option for future upgrades.
AWS offers near-zero downtime Redis upgrades and for reasons specific to our platform, we are currently unable to take advantage of this. Therefore an upgrade on our platform currently requires about two hours of downtime. The reasons for this mainly stem from the fact that the Platform Product team doesn't own all of the code in vets-api, and therefore lack a clear understanding regarding the data in our clusters. Is it known if it be reproduced in case data gets lost? Are writes to the cluster robust, in case the cluster is unavailable for any amount of time?
This has led us to upgrade paths that take the entire application stack offline, as documented for our most current upgrade
The goal of the stories epic is to scrutinize our Redis clusters and make changes to them, to the extent possible, such that we can take advantage of AWS upgrade paths, or minimize downtime for clusters that can't follow these paths (sidekiq, might be an example).
Tasks
[ ] Review current tools for near zero-downtime option usage when upgrading (Redis?)
[ ] Review other tools on the market that would provide zero-downtime
[ ] Risk assessment comparing all tools
[ ] Cost analysis comparing all tools
[ ] Feasibility study comparing all tools
[ ] Other tasks as necessary
Success Metrics
[ ] Risk assessment complete
[ ] Cost analysis complete
[ ] Feasibility study complete
Acceptance Criteria
[ ] Final cost analysis ready for review by OCTO PO/TL
[ ] Final risk assessment ready for review by OCTO PO/TL
[ ] Final feasibility study ready for review by OCTO PO/TL
OKR
O1. Our digital experiences are the best way to access VA health care and benefits
OKR 3: All new products have a faster transaction time than those they replaced
Other Stories Worth Considering
Aside from the stories in this epic, is it worth exploring the following:
What does it take to continue to serve static content (CMS) during an upgrade.
Can we get lighthouse out of revproxy?
Need to understand what the upgrade path for our desired feature set looks like for doing a partitioned upgrade.
Validation
Assignee to add steps to this section. List the actions that need to be taken to confirm this issue is complete. Include any necessary links or context. State the expected outcome.
As the Owners of the Platform, We need to understand what the options are for near no-downtime/sharded upgrades of our tools, So that there is near zero disruptions (in the form of downtime) of the Platform for users when we have upgrades that need to be implemented.
Issue Description
Need to understand the options for doing near no-downtime upgrades for the Platform Product team and the current tools that are currently in operation. This is the discovery Epic to throughly investigate what is available by using our current tools, or to review other options on the market for successfully implementing a near zero-downtime option for future upgrades.
AWS offers near-zero downtime Redis upgrades and for reasons specific to our platform, we are currently unable to take advantage of this. Therefore an upgrade on our platform currently requires about two hours of downtime. The reasons for this mainly stem from the fact that the Platform Product team doesn't own all of the code in vets-api, and therefore lack a clear understanding regarding the data in our clusters. Is it known if it be reproduced in case data gets lost? Are writes to the cluster robust, in case the cluster is unavailable for any amount of time?
This has led us to upgrade paths that take the entire application stack offline, as documented for our most current upgrade
The goal of the stories epic is to scrutinize our Redis clusters and make changes to them, to the extent possible, such that we can take advantage of AWS upgrade paths, or minimize downtime for clusters that can't follow these paths (sidekiq, might be an example).
Tasks
Success Metrics
Acceptance Criteria
OKR
O1. Our digital experiences are the best way to access VA health care and benefits OKR 3: All new products have a faster transaction time than those they replaced
Other Stories Worth Considering
Aside from the stories in this epic, is it worth exploring the following:
Validation
Assignee to add steps to this section. List the actions that need to be taken to confirm this issue is complete. Include any necessary links or context. State the expected outcome.