As a system owner of Notify,
I need to be able to recover from disasters as quickly as possible
and have the necessary steps automated and documented for several scenarios,
so that I can support our business continuity plan.
The business continuity plan is important for Notify to guarantee it can recover in risky situations such as if the environment gets destroyed due to an external attack or an AWS region gets down for X reason.
As the SRE member in charge of BCP for Notify, I would like the infrastructure as code for Notify to be able to deploy to a brand new AWS account without any errors.
WHY are we building?
GC Notify is a critical service that supports other departments that could also be affected by disasters. It is important that we restore notify as quickly as possible so that our clients can communicate with their clients.
WHAT are we building?
Identify BCP scenarios
Create a BCP document with scenario remediation steps
Pathfind the BCP readiness process for GC Notify and possibly other CDS products such as Forms
Improve the terragrunt code for ease of remediation
VALUE created by our solution
BCP Readiness
Acceptance Criteria
[ ] Terragrunt run-all apply works correctly (low effort)
[ ] Terragrunt deployment to an empty AWS account works correctly (low effort)
[ ] We determined the time it takes to build GCNotify from scratch in a reliable manner. This will be useful when we share our BCP report and have our users assess risk of how long GCNotify can go down in the case of an emergency.
QA Steps
[ ] Deployed against a new AWS account
[ ] Deployed against a new environment in existing AWS account
BCP Scenarios
Setup a new environment in the same region.
Setup a new environment in a new region.
Re-import database back following hypothetical incident corrupting the current database.
How to manually release GCNotify without the Github automation in place.
Spoke with Pat Re: Satellite S3 buckets. These are managed by another TF repository - I'm going to look into refactoring the code to accommodate this rather than importing these resources
Upgraded AWS provider to 4.0 in staging and production
Description
As a system owner of Notify, I need to be able to recover from disasters as quickly as possible and have the necessary steps automated and documented for several scenarios, so that I can support our business continuity plan.
The business continuity plan is important for Notify to guarantee it can recover in risky situations such as if the environment gets destroyed due to an external attack or an AWS region gets down for X reason.
As the SRE member in charge of BCP for Notify, I would like the infrastructure as code for Notify to be able to deploy to a brand new AWS account without any errors.
WHY are we building?
GC Notify is a critical service that supports other departments that could also be affected by disasters. It is important that we restore notify as quickly as possible so that our clients can communicate with their clients.
WHAT are we building?
Identify BCP scenarios Create a BCP document with scenario remediation steps Pathfind the BCP readiness process for GC Notify and possibly other CDS products such as Forms Improve the terragrunt code for ease of remediation
VALUE created by our solution
BCP Readiness
Acceptance Criteria
QA Steps
BCP Scenarios