Open ben851 opened 11 months ago
Before applying the migration in staging, Ben wants Steve to be able to execute the whole execution in the dev environment to make sure everything is scripts and documented.
Steve and Ben tested yesterday the blue/green deployment in the dev environment. They will do it again today with more fixes and automation, all happening on Steve' setup to make sure it all works not only with Ben's machine.
Performed the upgrade in dev environment and it worked. Will run through dev again to make sure that latest changes to scripts and documentation worked.
Need another e2e dev run through with soak on, then will be able to proceed with
After rollercoater testing and bugbashing staging:
dev runthough went well. a couple 502s that we weren't sure were related. Scripts worked fine. Will rerun today.
will do another full runthrough on dev today.
added rows to dev database to make the same size as prod (126M rows), then ran through migration
Step | Time | Downtime | Lost Notifications | Other Issues |
---|---|---|---|---|
1. Remove RDS proxy | 21 min | none | none | none |
2. Create blue/green | 38 min | none | none | one API gateway timeout |
3. Switch from blue to green | 2 min | 14 sec | none | 15 notifications stuck "created" |
4 remove blue/green | 12 min | none | none | none |
5 restore RDS proxy | 25 min | none | none | none |
Item #3 from the last comment will need some discussion as some notifications sent during the 14 seconds downtime period will be sent after 4h15 minutes if we do not take manual steps to unblock these.
To deal with the stuck "created" notifications we will stop the beat worker during the switch over. We've changed the switchover script accordingly and will (hopefully) have the dev database restored to dev-sized 11.21 by tomorrow so we can test this approach.
Last test in dev today, aiming to start staging testing tomorrow. In staging:
Will test in staging for ~1 week.
the scale down of the beat worker pod took longer than the switchover, so it didn't actually stop before the switch and we again had stuck "created" notifications. Resetting database to 11.21 to try a different approach tomorrow.
Will also migrate staging tomorrow to allow us to start testing PostGreSQL 15.
Ben thinks something like this might bring down the beat worker faster:
kubectl scale --replicas 0 deployment/celery-beat -n notification-canada-ca
kubectl delete pod $(kubectl get pods -n notification-canada-ca | grep celery-beat | awk '{print $1}') --force --grace-period 0 -n notification-canada-ca
We shall test this morning.
Ben's suggestion insta-killed the beat worker, so we should be able to do this in the switch without adding more downtime (ie if we added a 2 minute sleep
).
Started on staging with Step 1 (removing the rds proxy)
We took a snapshot of staging before starting. We could revert to this snapshot and check that we can now run through the first step without issues. Will discuss tomorrow...
Overall I think it's fair to say that staging survived the :roller_coaster: test without the proxy in place :tada:
System
Database
Ben and Steve to re-run this today in staging!
Upgraded staging to 15.5. Had a few issues along the way, mainly:
A few other changes to the scripts / branches were needed - we made corresponding changes for prod. Also, as expected, we needed to do a terraform PR to set the parameter group for staging to 15 rather than 11. We'll need to do a corresponding PR after the prod release. Also, as noted yesterday, we'll have to make sure that the prod tfvars in 1Password are correct.
We'll let the changes soak in staging to see how it goes. We will organize a bug bash this week to test the features all around.
Before the prod upgrade we need to:
Had to add 4 vars to the prod tfvars file (added to 1Password as well). With these added, terragrunt plan (on main) reports "no changes" for rds, lambda-api, and database-tools
Ran 4 hour load test, looks normal.
@ben851 to update the ADR today
New ADR created specific to this DB Migration https://github.com/cds-snc/notification-adr/blob/1bbb9681c0386b66c3c9e21565e8c9a4b4c26a34/records/2023-12-19.upgrade-database-major-version.md
Jimmy to review
LGTM
ADRs were merged on Monday. 🎉
Description
As a an ops lead, I need an up to date database, So that I can keep AWS support on these And maintain viability of our database in our production application.
WHY are we building?
PostgreSQL is aging out of support, we need to upgrade.
WHAT are we building?
VALUE created by our solution
GCNotify is up to date, faster, more secure, and with support still going for its database.
Acceptance Criteria
QA Steps
Additional Information