ben851 commented 1 year ago

Description

As a system owner of Notify, I need to be able to recover from disasters as quickly as possible and have the necessary steps automated and documented for several scenarios, so that I can support our business continuity plan.

The business continuity plan is important for Notify to guarantee it can recover in risky situations such as if the environment gets destroyed due to an external attack or an AWS region gets down for X reason.

As the SRE member in charge of BCP for Notify, I would like the infrastructure as code for Notify to be able to deploy to a brand new AWS account without any errors.

WHY are we building?

GC Notify is a critical service that supports other departments that could also be affected by disasters. It is important that we restore notify as quickly as possible so that our clients can communicate with their clients.

WHAT are we building?

Identify BCP scenarios Create a BCP document with scenario remediation steps Pathfind the BCP readiness process for GC Notify and possibly other CDS products such as Forms Improve the terragrunt code for ease of remediation

VALUE created by our solution

BCP Readiness

Acceptance Criteria

[ ] Terragrunt run-all apply works correctly (low effort)
[ ] Terragrunt deployment to an empty AWS account works correctly (low effort)
[ ] We determined the time it takes to build GCNotify from scratch in a reliable manner. This will be useful when we share our BCP report and have our users assess risk of how long GCNotify can go down in the case of an emergency.

QA Steps

[ ] Deployed against a new AWS account
[ ] Deployed against a new environment in existing AWS account

BCP Scenarios

Setup a new environment in the same region.
Setup a new environment in a new region.
Re-import database back following hypothetical incident corrupting the current database.
How to manually release GCNotify without the Github automation in place.
GitHub is unavailable

jimleroyer commented 1 year ago

Ben stuck on some sqlalchemy error. Team to help in the channel.
We will set a stop line async on the task to know when to stop.

ben851 commented 1 year ago

started deleting my scratch account
ran into issues deleting due to dangling aws system resources - had to open support case w/ AWS
running into dependency issues on deletion

ben851 commented 1 year ago

We do not have permissions to disable EBS encryption (which is probably good) - these are set in the landing zone. I have opened an issue with the landing zone repo to explicitly enforce ebs encryption so we can remove the references from our environment.

ben851 commented 1 year ago

Deleted as much as possible in my scratch account with a combination of terragrunt and aws nuke.
Created aws-nuke config
Created baseline scratch terragrunt config based off of staging

ben851 commented 1 year ago

Imported notify internal sqs queue into staging environment. Need to do production before merging to main.

ben851 commented 1 year ago

Upping estimate due to increased complexity in ensuring we do not bring down production or staging, and additional external depenencies on SRE team

ben851 commented 1 year ago

Imported notify internal sqs to prod
Removed EBS Encryption from tfstates in staging, prod, created PR for release to prod tomorrow AM.

jimleroyer commented 1 year ago

Currently needs to experiment in staging environment. Need ack from other developers to freeze the TF repository for approximately 4h.

ben851 commented 1 year ago

Spoke with Pat Re: Satellite S3 buckets. These are managed by another TF repository - I'm going to look into refactoring the code to accommodate this rather than importing these resources
Upgraded AWS provider to 4.0 in staging and production

cds-snc / notification-planning-core

BCP: Document Scenarios and Remediation Steps for BCP #27

Description

WHY are we building?

WHAT are we building?

VALUE created by our solution

Acceptance Criteria

QA Steps

BCP Scenarios