ben851 commented 11 months ago

Description

As a an ops lead, I need an up to date database, So that I can keep AWS support on these And maintain viability of our database in our production application.

WHY are we building?

PostgreSQL is aging out of support, we need to upgrade.

keeping support with AWS
security policies (with regard to patching)
performance improvements with more recent versions

WHAT are we building?

A migration strategy to go from version 11 to version 15 with our AWS RDS Aurora v1 database which will serve us for subsequent migrations.
Add the migration strategy to the ADR
migrate staging to PostgreSQL 15
production upgrade will be handled by another card

VALUE created by our solution

GCNotify is up to date, faster, more secure, and with support still going for its database.

Acceptance Criteria

[x] Migration plan
[x] Migration steps are automated/scripted as much as possible
[x] Detailed timeline of the migration steps, executed in lower environments (dev first, staging second)
[x] staging works the same as prior to the upgrade
[x] ADR Should be updated with the outcome of the solution

QA Steps

[x] Tested against Dev and Staging with detailed timeframes for both.
[x] Automation scripts are executed
[x] Load Testing is done against staging and is within the same results of previous tests
[x] Smoke Test passes
[x] Soak Tests pass
[x] rollercoaster tests are normal

Additional Information

GDS has sent a draft blog post on how they did their database upgrade with only 11 seconds of downtime. We should review this for lessons learned (and possibly stealing they're whole strategy)
Look into Blue/Green deployment

jimleroyer commented 10 months ago

Before applying the migration in staging, Ben wants Steve to be able to execute the whole execution in the dev environment to make sure everything is scripts and documented.

jimleroyer commented 10 months ago

Steve and Ben tested yesterday the blue/green deployment in the dev environment. They will do it again today with more fixes and automation, all happening on Steve' setup to make sure it all works not only with Ben's machine.

jimleroyer commented 10 months ago

Performed the upgrade in dev environment and it worked. Will run through dev again to make sure that latest changes to scripts and documentation worked.

sastels commented 9 months ago

Need another e2e dev run through with soak on, then will be able to proceed with

staging
dev made prod-size and then run through again to get timings

After rollercoater testing and bugbashing staging:

prod!

sastels commented 9 months ago

dev runthough went well. a couple 502s that we weren't sure were related. Scripts worked fine. Will rerun today.

sastels commented 9 months ago

will do another full runthrough on dev today.

sastels commented 9 months ago

added rows to dev database to make the same size as prod (126M rows), then ran through migration

Step	Time	Downtime	Lost Notifications	Other Issues
1. Remove RDS proxy	21 min	none	none	none
2. Create blue/green	38 min	none	none	one API gateway timeout
3. Switch from blue to green	2 min	14 sec	none	15 notifications stuck "created"
4 remove blue/green	12 min	none	none	none
5 restore RDS proxy	25 min	none	none	none

jimleroyer commented 9 months ago

Item #3 from the last comment will need some discussion as some notifications sent during the 14 seconds downtime period will be sent after 4h15 minutes if we do not take manual steps to unblock these.

sastels commented 9 months ago

To deal with the stuck "created" notifications we will stop the beat worker during the switch over. We've changed the switchover script accordingly and will (hopefully) have the dev database restored to dev-sized 11.21 by tomorrow so we can test this approach.

ben851 commented 9 months ago

Last test in dev today, aiming to start staging testing tomorrow. In staging:

Perf tests
Bug Bash

Will test in staging for ~1 week.

sastels commented 9 months ago

the scale down of the beat worker pod took longer than the switchover, so it didn't actually stop before the switch and we again had stuck "created" notifications. Resetting database to 11.21 to try a different approach tomorrow.

Will also migrate staging tomorrow to allow us to start testing PostGreSQL 15.

sastels commented 9 months ago

Ben thinks something like this might bring down the beat worker faster:

kubectl scale --replicas 0 deployment/celery-beat -n notification-canada-ca
kubectl delete pod $(kubectl get pods -n notification-canada-ca | grep celery-beat | awk '{print $1}') --force --grace-period 0 -n notification-canada-ca

We shall test this morning.

sastels commented 9 months ago

Ben's suggestion insta-killed the beat worker, so we should be able to do this in the switch without adding more downtime (ie if we added a 2 minute sleep).

Started on staging with Step 1 (removing the rds proxy)

had a bunch of issues, mainly (all?) due to the tfvars in 1Password being out of date with the system :/
eventually got it done, kicked off rollercoaster tests to guage the impact of doing the preparation steps early (ex: an hour or so before the switch)
running this load test 2pm - 6 pm

We took a snapshot of staging before starting. We could revert to this snapshot and check that we can now run through the first step without issues. Will discuss tomorrow...

sastels commented 9 months ago

sent about 343K emails, 42K (internal #) sms
have 5 "created" emails all created 2:45. This coincided with a 1 minute drop in the # scalable celery pods from 30 at 2:44 to 9 and then back up to 30 at 2:46. There's nothing particularly strange in the database dashboard at that time, so I don't think this was rds related. We might want to investigate if this is something that has randomly happened before. If so we might see a 4hr15min difference in sent_at vs created_at.
There was one warning in #notification-staging-ops at 2:25, priority queue jumped up to a 16 s delay a few times close together.

Overall I think it's fair to say that staging survived the :roller_coaster: test without the proxy in place :tada:

System

Database

ben851 commented 9 months ago

Ben and Steve to re-run this today in staging!

sastels commented 9 months ago

Upgraded staging to 15.5. Had a few issues along the way, mainly:

still getting a few stuck "created" during the switchover
a reboot of the RDS writer instance was required during the blue/green creation. This required about 7 seconds of downtime. We probably will need to do this on prod as well, and can/should do independantly of the upgrade

A few other changes to the scripts / branches were needed - we made corresponding changes for prod. Also, as expected, we needed to do a terraform PR to set the parameter group for staging to 15 rather than 11. We'll need to do a corresponding PR after the prod release. Also, as noted yesterday, we'll have to make sure that the prod tfvars in 1Password are correct.

jimleroyer commented 9 months ago

We'll let the changes soak in staging to see how it goes. We will organize a bug bash this week to test the features all around.

sastels commented 9 months ago

Before the prod upgrade we need to:

[x] verify that our 1Password prod tfvars are correct: verify no terraform changes in plan for
- [x] rds
- [x] lambda
- [x] database-tools
[x] do 4 hours of load testing on staging
[ ] do a bug bash on staging

sastels commented 9 months ago

Had to add 4 vars to the prod tfvars file (added to 1Password as well). With these added, terragrunt plan (on main) reports "no changes" for rds, lambda-api, and database-tools

sastels commented 9 months ago

Ran 4 hour load test, looks normal.

a load balancer 502 and an admin 500 that probably aren't related to the RDS upgrade
overlapped with the daily perf test :joy:

ben851 commented 9 months ago

@ben851 to update the ADR today

ben851 commented 9 months ago

New ADR created specific to this DB Migration https://github.com/cds-snc/notification-adr/blob/1bbb9681c0386b66c3c9e21565e8c9a4b4c26a34/records/2023-12-19.upgrade-database-major-version.md

ben851 commented 9 months ago

Jimmy to review

sastels commented 8 months ago

LGTM

jimleroyer commented 8 months ago

ADRs were merged on Monday. 🎉

cds-snc / notification-planning-core

Develop migration plan for PostgreSQL 11 to 15 #238

Description

WHY are we building?

WHAT are we building?

VALUE created by our solution

Acceptance Criteria

QA Steps

Additional Information