cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

Develop migration plan for PostgreSQL 11 to 15 #238

Open ben851 opened 11 months ago

ben851 commented 11 months ago

Description

As a an ops lead, I need an up to date database, So that I can keep AWS support on these And maintain viability of our database in our production application.

WHY are we building?

PostgreSQL is aging out of support, we need to upgrade.

WHAT are we building?

VALUE created by our solution

GCNotify is up to date, faster, more secure, and with support still going for its database.

Acceptance Criteria

QA Steps

Additional Information

jimleroyer commented 10 months ago

Before applying the migration in staging, Ben wants Steve to be able to execute the whole execution in the dev environment to make sure everything is scripts and documented.

jimleroyer commented 10 months ago

Steve and Ben tested yesterday the blue/green deployment in the dev environment. They will do it again today with more fixes and automation, all happening on Steve' setup to make sure it all works not only with Ben's machine.

jimleroyer commented 10 months ago

Performed the upgrade in dev environment and it worked. Will run through dev again to make sure that latest changes to scripts and documentation worked.

sastels commented 9 months ago

Need another e2e dev run through with soak on, then will be able to proceed with

After rollercoater testing and bugbashing staging:

sastels commented 9 months ago

dev runthough went well. a couple 502s that we weren't sure were related. Scripts worked fine. Will rerun today.

sastels commented 9 months ago

will do another full runthrough on dev today.

sastels commented 9 months ago

added rows to dev database to make the same size as prod (126M rows), then ran through migration

Step Time Downtime Lost Notifications Other Issues
1. Remove RDS proxy 21 min none none none
2. Create blue/green 38 min none none one API gateway timeout
3. Switch from blue to green 2 min 14 sec none 15 notifications stuck "created"
4 remove blue/green 12 min none none none
5 restore RDS proxy 25 min none none none
jimleroyer commented 9 months ago

Item #3 from the last comment will need some discussion as some notifications sent during the 14 seconds downtime period will be sent after 4h15 minutes if we do not take manual steps to unblock these.

sastels commented 9 months ago

To deal with the stuck "created" notifications we will stop the beat worker during the switch over. We've changed the switchover script accordingly and will (hopefully) have the dev database restored to dev-sized 11.21 by tomorrow so we can test this approach.

ben851 commented 9 months ago

Last test in dev today, aiming to start staging testing tomorrow. In staging:

Will test in staging for ~1 week.

sastels commented 9 months ago

the scale down of the beat worker pod took longer than the switchover, so it didn't actually stop before the switch and we again had stuck "created" notifications. Resetting database to 11.21 to try a different approach tomorrow.

Will also migrate staging tomorrow to allow us to start testing PostGreSQL 15.

sastels commented 9 months ago

Ben thinks something like this might bring down the beat worker faster:

kubectl scale --replicas 0 deployment/celery-beat -n notification-canada-ca
kubectl delete pod $(kubectl get pods -n notification-canada-ca | grep celery-beat | awk '{print $1}') --force --grace-period 0 -n notification-canada-ca

We shall test this morning.

sastels commented 9 months ago

Ben's suggestion insta-killed the beat worker, so we should be able to do this in the switch without adding more downtime (ie if we added a 2 minute sleep).

Started on staging with Step 1 (removing the rds proxy)

We took a snapshot of staging before starting. We could revert to this snapshot and check that we can now run through the first step without issues. Will discuss tomorrow...

sastels commented 9 months ago

Overall I think it's fair to say that staging survived the :roller_coaster: test without the proxy in place :tada:

System

image.png

Database

image.png
ben851 commented 9 months ago

Ben and Steve to re-run this today in staging!

sastels commented 9 months ago

Upgraded staging to 15.5. Had a few issues along the way, mainly:

A few other changes to the scripts / branches were needed - we made corresponding changes for prod. Also, as expected, we needed to do a terraform PR to set the parameter group for staging to 15 rather than 11. We'll need to do a corresponding PR after the prod release. Also, as noted yesterday, we'll have to make sure that the prod tfvars in 1Password are correct.

jimleroyer commented 9 months ago

We'll let the changes soak in staging to see how it goes. We will organize a bug bash this week to test the features all around.

sastels commented 9 months ago

Before the prod upgrade we need to:

sastels commented 9 months ago

Had to add 4 vars to the prod tfvars file (added to 1Password as well). With these added, terragrunt plan (on main) reports "no changes" for rds, lambda-api, and database-tools

sastels commented 9 months ago

Ran 4 hour load test, looks normal.

image.png
ben851 commented 9 months ago

@ben851 to update the ADR today

ben851 commented 9 months ago

New ADR created specific to this DB Migration https://github.com/cds-snc/notification-adr/blob/1bbb9681c0386b66c3c9e21565e8c9a4b4c26a34/records/2023-12-19.upgrade-database-major-version.md

ben851 commented 9 months ago

Jimmy to review

sastels commented 8 months ago

LGTM

jimleroyer commented 8 months ago

ADRs were merged on Monday. 🎉