cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

Upgrade staging and prod databases to latest 11.x version #240

Open ben851 opened 6 months ago

ben851 commented 6 months ago

Description

As a developer of notify, I would like our database to be up to date so that we can ensure we have the latest security patches and support from AWS.

WHY are we building?

We currently automatic updates disabled, so we are behind in minor versions of postgres. We need up upgrade off of 11.17 before January 16th.

WHAT are we building?

Click-ops the upgrade in dev, while running a soak test to see if there is any downtime.

VALUE created by our solution

Increased stability and security We will retain support from AWS.

Acceptance Criteria

QA Steps

ben851 commented 6 months ago

I've migrated the staging database to dev and have been running tests on how best to upgrade to 11.21 and 15.x

So far, blue/green will not work because it doesn't support RDS proxy

All 0 downtime migrations require a later version of Postgres, so we would have to schedule downtime to upgrade to something that supports 0 downtime.

I'm going to run some soak tests while upgrading to 11.21 and then run a soak test upgrading to 15.x and see what the outage time difference is. We may be better off just doing a one time "big" upgrade (assuming it tests well in real staging)

sastels commented 6 months ago

We might want to either

The first option would do a better job of mimicking prod

sastels commented 6 months ago

Note that we could disable RDS Proxy during the upgrade

jimleroyer commented 6 months ago

We should also raise this to AWS SMEs on RDS to have their take on our migration.

ben851 commented 6 months ago

Here are the preliminary results of doing a straight database upgrade.

https://docs.google.com/document/d/1X3ykvlqhdfVniU8LkN9drWar1NFgWGK62xDOniEDeAI/edit

Of particular note, the upgrade from 11 to 15 took 12 minutes of downtime.

Today I'm going to look into removing the proxy in dev

ben851 commented 6 months ago

I managed to do a blue/green switchover in dev using clickops. Originally it didn't look very promising, but then I realized that the initial switchover failed due to timeout. Doing it again with a longer timeout resulted in an upgrade with little downtime.

Unfortunately while AWS supports blue/green with Aurora, Terraform does not: https://github.com/hashicorp/terraform-provider-aws/blob/main/docs/design-decisions/rds-bluegreen-deployments.md

I'm going to look into what we can do about this.

ben851 commented 6 months ago

I've created a set of scripts that will do the database migration. There's still some refining to do after the code freeze.

https://github.com/cds-snc/notification-attic/pull/47

ben851 commented 5 months ago

@sastels to review this PR and proceed with testing in dev.

ben851 commented 5 months ago

@sastels left some suggestions on the PR - I will work on implementing those at some point.

ben851 commented 5 months ago

@sastels will be doing a migration test in dev this morning.

ben851 commented 5 months ago

@sastels and I worked through the first step of the migration yesterday, tracking issues. We're currently debugging why patches don't work on his system.

ben851 commented 5 months ago

We will aim to do the 11.21 upgrade on Monday. We need to add the logical replication parameter to the parameter group before then so that we don't have to restart the database in the future.

ben851 commented 5 months ago

Jimmy to inform notify team

jimleroyer commented 5 months ago

Upgrade happened yesterday evening around 21h26 EST for the minor version upgrade to 11.21 with a downtime of 55 seconds. 👏🎉