cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

BCP: Step By Step Document For Building New Environment in same region #58

Open ben851 opened 1 year ago

ben851 commented 1 year ago

Description

As a developer of Notify, I would like to have a document available that walks through the step by step process for deploying a new environment so that I can easily rebuild an environment with little knowledge of the underlying system.

This can be the start of the BCP recovery scenario document.

WHY are we building?

Building new environments from scratch is not only helpful from a BCP perspective, but also when doing major updates to infrastructure.

This is also a good exercise for new team members of Notify Core to learn the infrastructure.

It is important to have instructions on how to recover the environment in the event that an AWS region goes down.

WHAT are we building?

A parent document describing several scenarios, linking to step by step procedures for recovery.

VALUE created by our solution

Acceptance Criteria

QA Steps

P0NDER0SA commented 7 months ago

starting to put eyes on this card and absorb the details...

sastels commented 7 months ago

Ben will find / show Pond the docs we have so far.

P0NDER0SA commented 7 months ago

Ben and Pond will definitely try to get the ball rolling on this ASAP

ben851 commented 7 months ago

Ben's old doc: https://docs.google.com/document/d/1BpE5ghvIVPKt43UojDtzd7zrzw8EacF6iosNI_EZ7qU/edit

Ben's new doc: https://docs.google.com/document/d/1hEKUl7uH0Ksr_1Qk3tv8O43gNAe14QQVREevrQLodS4/edit

P0NDER0SA commented 7 months ago

Chewed through the first 3 or 4 steps and fixed some issues. OPened a big PR (because files were moved/recreated). Will need to attempt a merge in the morning to continue.

ben851 commented 7 months ago

Ran into an issue with api gateway logging, this looks to be the solution https://stackoverflow.com/questions/52156285/terraform-how-to-enable-api-gateway-execution-logging

P0NDER0SA commented 7 months ago

made lots of progress but hit an issue with the API logging gateway which requires weird IAM . investigating...

jimleroyer commented 7 months ago

We got past the IAM issues, now we are facing an issue with QuickSight, related to the AWS account relationship with it.

jimleroyer commented 7 months ago

Pond setting up the environment in his own scratch account. Current BCP goal is to deploy in a new AWS account in the same account. Everything 💥 at 90% of the way to get TF apply . Scratch environment was reset that needed to be rebuilt. Updated our own nuking, about to open a new PR for it. The AWS nuke script does not work as expected. We need to provide our nuke config to the SRE team or they could merge some of our changes at least.

ben851 commented 7 months ago

@P0NDER0SA is on the last step for terraform (system status static website).

Also running into issues with Quicksight, that we skipped, and will circle back with @sastels.

ben851 commented 7 months ago

Skipping quicksight for now because there are additional issues. (Authenticating with the database)

Ran into an issue installing CRDs with nginx

We are now working on the helm deployments for the manifests repo, and then will hit Kustomize.

ben851 commented 7 months ago

Continue to work on helm as part of the helm working sessions.

sastels commented 6 months ago

Will discuss next steps when Pond is back.

sastels commented 6 months ago

Debugging nginx right now.

sastels commented 6 months ago

Progress being made! nginx fixed and deploying (dev). Had to rewrite the priority classes and add some imports.

P0NDER0SA commented 6 months ago

This afternoon we have all of the helmfile stuff applied, but we do have to circle back and investigate an issue with the ingress. We also deployed all of the kustomize stuff, but the app doesn't build because of the environment and because the DB is empty. Going back to work on the DB, then we will resume kustomize.

sastels commented 6 months ago

Attempted to migrate the db in a couple ways (AWS snapshot failed due to encyption, pgdump also failed, timed out) :/

jimleroyer commented 6 months ago

Need to fix the database migration timeout. The pgdump explicitly is timing out.

P0NDER0SA commented 6 months ago

Database migration ended up working -- but with Caveats. The import wasn't quite right -- have to do that part again. Had to debug and update some variables to get all of the pods working.

All pods were running, application was working, but some errors in the DB kept it from being 100%. We will finish the DB migration correctly today and then we are done our first pass. After that, back to step 1 -- nuke the environment and run it again while making sure any code updates and documentation updates are made while doing it.

jimleroyer commented 6 months ago

We will also need to do extra passes to recreate the environment to make sure we covered all the steps!

jimleroyer commented 6 months ago

The new issue is with DNS setup where the scripts that handled these were outdated and not working. We will test this part again today to see what breaks next.

P0NDER0SA commented 6 months ago

We are almost done the first passthrough but having issues with spice on the quicksight. looking for some help from Steve and then we can tear everything down and start again.

P0NDER0SA commented 6 months ago

nuke scripts have been updated and are ready to try.

P0NDER0SA commented 6 months ago

We have run into issues with AWS Deleting certain resources, and there are some Quicksight issues that can't be resolved using code, so we will have to look at fixing those during our second pass. attempting an environmental nuke today.

P0NDER0SA commented 6 months ago

starting pass 2 on the new sandbox account and we will be opening multiple PRs to merge any code changes that are necessary.

ben851 commented 6 months ago

We had to request a quota increase on elastic IPs on Thursday

ben851 commented 6 months ago

Requesting quota increases significantly increases TTR - we should request a second AWS account "Prod Standby" that has the out of band processes completed, but no infrastructure deployed to it.

P0NDER0SA commented 6 months ago

https://github.com/cds-snc/notification-terraform/pull/1300 https://github.com/cds-snc/notification-terraform/pull/1301 https://github.com/cds-snc/notification-terraform/pull/1302 https://github.com/cds-snc/notification-terraform/pull/1303

P0NDER0SA commented 6 months ago

multiple PRs opened for the round 2 TF portion. DB Migration is next

P0NDER0SA commented 6 months ago

PR for Helmfile work open and merged successfuly.

P0NDER0SA commented 6 months ago

PR for Kustomize Work https://github.com/cds-snc/notification-manifests/pull/2594

ben851 commented 6 months ago

Terraform/Helmfile work was completed and very smooth with a few minor tweaks. Kustomize was implemented as well.

We need to debug Karpenter in sandbox - because the scalable pods were sitting on pending. That will be the last of it, and then we delete it and start again.

ben851 commented 6 months ago

Karpenter is working.

Debugged the .env diff on the PRs and set up the makefile for sandbox

Kustomize all merged in and working.

~Forgotten password email didn't work, but we could log in to Notify.~ Nope it was good!

sastels commented 5 months ago

Finished second pass on sandbox :tada: Next will nuke dev and run it there.

P0NDER0SA commented 5 months ago

Destroyed Sandbox with TF! Developed the GHA to destroy Dev with TF -- debugging and updating it. Going to create the card for Environment Recovery (and Destroy) Automation. Did another staging pg dump of the DB from staging to import into dev. (future intention to run and document scenario of recovering database from a snapshot as well).

sastels commented 5 months ago

Things are going well. Working on dev create / delete automation.

jimleroyer commented 5 months ago

Re-running the scenario using the database snapshot rather than using the pgdump. This would cover a scenario of restoring the database using a snapshot.

P0NDER0SA commented 5 months ago

All kustomize and helm deployment steps are completed (including configuration changes required by kustomize). https://github.com/cds-snc/notification-manifests/pull/2629

Looking at finalizing AWS steps and doing a Sanity check to ensure things are working. Then I want to do a snapshot backup and restoration.

jimleroyer commented 5 months ago

App is deployed in dev but there are extra manual steps to be done.

P0NDER0SA commented 5 months ago

Dev is working! Moving on to DB Snapshots

P0NDER0SA commented 5 months ago

still hoping to get to the snapshots

sastels commented 5 months ago
ben851 commented 5 months ago

Ben will look at destruction scripts

ben851 commented 5 months ago

Did not get to this yesterday, will try and get to it today.

ben851 commented 5 months ago

Stuck working on K8s 1.30 chore and ADR for CICD - did not get to this today.

ben851 commented 5 months ago

Before starting on create/destroy scripts, I would like to get the last PR merged so that I can work off of main:

github.com/cds-snc/notification-terraform/pull/1334

ben851 commented 5 months ago

Had some issues with the migration of system status static site. I've created a new PR to merge just that part

https://github.com/cds-snc/notification-terraform/pull/1361

But I will have to do some thinking on the best way to do this since this will cause prod to delete as well.

sastels commented 5 months ago

mostly merged except for system status static site, needs a script.

ben851 commented 4 months ago

I've verified that there will be approx 10-15 minutes of downtime of the system status site during the recreation. We are notifying clients and planning to deploy this next Wednesday.

Between now and then, I will be doing one more test run in staging and writing up documentation in the notification-attic repo

ben851 commented 4 months ago

Will try and do this one more time in staging today.