Open ben851 opened 1 year ago
starting to put eyes on this card and absorb the details...
Ben will find / show Pond the docs we have so far.
Ben and Pond will definitely try to get the ball rolling on this ASAP
Chewed through the first 3 or 4 steps and fixed some issues. OPened a big PR (because files were moved/recreated). Will need to attempt a merge in the morning to continue.
Ran into an issue with api gateway logging, this looks to be the solution https://stackoverflow.com/questions/52156285/terraform-how-to-enable-api-gateway-execution-logging
made lots of progress but hit an issue with the API logging gateway which requires weird IAM . investigating...
We got past the IAM issues, now we are facing an issue with QuickSight, related to the AWS account relationship with it.
Pond setting up the environment in his own scratch account. Current BCP goal is to deploy in a new AWS account in the same account. Everything 💥 at 90% of the way to get TF apply . Scratch environment was reset that needed to be rebuilt. Updated our own nuking, about to open a new PR for it. The AWS nuke script does not work as expected. We need to provide our nuke config to the SRE team or they could merge some of our changes at least.
@P0NDER0SA is on the last step for terraform (system status static website).
Also running into issues with Quicksight, that we skipped, and will circle back with @sastels.
Skipping quicksight for now because there are additional issues. (Authenticating with the database)
Ran into an issue installing CRDs with nginx
We are now working on the helm deployments for the manifests repo, and then will hit Kustomize.
Continue to work on helm as part of the helm working sessions.
Will discuss next steps when Pond is back.
Debugging nginx right now.
Progress being made! nginx fixed and deploying (dev). Had to rewrite the priority classes and add some imports.
This afternoon we have all of the helmfile stuff applied, but we do have to circle back and investigate an issue with the ingress. We also deployed all of the kustomize stuff, but the app doesn't build because of the environment and because the DB is empty. Going back to work on the DB, then we will resume kustomize.
Attempted to migrate the db in a couple ways (AWS snapshot failed due to encyption, pgdump also failed, timed out) :/
Need to fix the database migration timeout. The pgdump
explicitly is timing out.
Database migration ended up working -- but with Caveats. The import wasn't quite right -- have to do that part again. Had to debug and update some variables to get all of the pods working.
All pods were running, application was working, but some errors in the DB kept it from being 100%. We will finish the DB migration correctly today and then we are done our first pass. After that, back to step 1 -- nuke the environment and run it again while making sure any code updates and documentation updates are made while doing it.
We will also need to do extra passes to recreate the environment to make sure we covered all the steps!
The new issue is with DNS setup where the scripts that handled these were outdated and not working. We will test this part again today to see what breaks next.
We are almost done the first passthrough but having issues with spice on the quicksight. looking for some help from Steve and then we can tear everything down and start again.
nuke scripts have been updated and are ready to try.
We have run into issues with AWS Deleting certain resources, and there are some Quicksight issues that can't be resolved using code, so we will have to look at fixing those during our second pass. attempting an environmental nuke today.
starting pass 2 on the new sandbox account and we will be opening multiple PRs to merge any code changes that are necessary.
We had to request a quota increase on elastic IPs on Thursday
Requesting quota increases significantly increases TTR - we should request a second AWS account "Prod Standby" that has the out of band processes completed, but no infrastructure deployed to it.
multiple PRs opened for the round 2 TF portion. DB Migration is next
PR for Helmfile work open and merged successfuly.
PR for Kustomize Work https://github.com/cds-snc/notification-manifests/pull/2594
Terraform/Helmfile work was completed and very smooth with a few minor tweaks. Kustomize was implemented as well.
We need to debug Karpenter in sandbox - because the scalable pods were sitting on pending. That will be the last of it, and then we delete it and start again.
Karpenter is working.
Debugged the .env diff on the PRs and set up the makefile for sandbox
Kustomize all merged in and working.
~Forgotten password email didn't work, but we could log in to Notify.~ Nope it was good!
Finished second pass on sandbox :tada: Next will nuke dev and run it there.
Destroyed Sandbox with TF! Developed the GHA to destroy Dev with TF -- debugging and updating it. Going to create the card for Environment Recovery (and Destroy) Automation. Did another staging pg dump of the DB from staging to import into dev. (future intention to run and document scenario of recovering database from a snapshot as well).
Things are going well. Working on dev create / delete automation.
Re-running the scenario using the database snapshot rather than using the pgdump. This would cover a scenario of restoring the database using a snapshot.
All kustomize and helm deployment steps are completed (including configuration changes required by kustomize). https://github.com/cds-snc/notification-manifests/pull/2629
Looking at finalizing AWS steps and doing a Sanity check to ensure things are working. Then I want to do a snapshot backup and restoration.
App is deployed in dev but there are extra manual steps to be done.
Dev is working! Moving on to DB Snapshots
still hoping to get to the snapshots
Ben will look at destruction scripts
Did not get to this yesterday, will try and get to it today.
Stuck working on K8s 1.30 chore and ADR for CICD - did not get to this today.
Before starting on create/destroy scripts, I would like to get the last PR merged so that I can work off of main:
Had some issues with the migration of system status static site. I've created a new PR to merge just that part
https://github.com/cds-snc/notification-terraform/pull/1361
But I will have to do some thinking on the best way to do this since this will cause prod to delete as well.
mostly merged except for system status static site, needs a script.
I've verified that there will be approx 10-15 minutes of downtime of the system status site during the recreation. We are notifying clients and planning to deploy this next Wednesday.
Between now and then, I will be doing one more test run in staging and writing up documentation in the notification-attic repo
Will try and do this one more time in staging today.
Description
As a developer of Notify, I would like to have a document available that walks through the step by step process for deploying a new environment so that I can easily rebuild an environment with little knowledge of the underlying system.
This can be the start of the BCP recovery scenario document.
WHY are we building?
Building new environments from scratch is not only helpful from a BCP perspective, but also when doing major updates to infrastructure.
This is also a good exercise for new team members of Notify Core to learn the infrastructure.
It is important to have instructions on how to recover the environment in the event that an AWS region goes down.
WHAT are we building?
A parent document describing several scenarios, linking to step by step procedures for recovery.
VALUE created by our solution
Acceptance Criteria
dev
environment (in progress)QA Steps