cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

Upgrade RDS Instance Types #387

Closed ben851 closed 2 days ago

ben851 commented 4 months ago

Description

As an operator of Notify, I would like our quicksight data to stay up to date So that end users can leverage the features of quicksight *PLEASE read the acceptance criteria before doing this work. :)

WHY are we building?

The quicksight data refresh jobs are failing and crashing the RDS instance that the refresh is running on due to too much data

WHAT are we building?

Increase RDS instance size as a temporary measure to resolve the issue

VALUE created by our solution

Quicksight data works again!

Acceptance Criteria

QA Steps

sastels commented 4 months ago

Manual upgrades on staging while doing a 1 email / second soak test:

Started write upgrade 10:33 staging

One 500 error (api lambda ssl closing error) Locust got: HTTPError('500 Server Error: Internal Server Error for url: /v2/notifications/email')

10 “SSL connection has been closed unexpectedly” related celery errors at 10:35 (triggered alarm :/ )

No failed emails

Next modify a reader! 10:41

Done 10:49

No errors

Last one (writer) 10:53

Finished 11:00 No errors

TL;DR

sastels commented 4 months ago

Made a PR to update the staging terraform to be xlarge.

So with the xlarge still there in AWS the plan says it won't do anything. So we can presumably upgrade manually and then merge a corresponding terraform PR without anything breaking.

Next will manually role back and then merge this PR while a soak test runs to see if we could just do it all with the PR.

sastels commented 4 months ago

PR to upgrade: https://github.com/cds-snc/notification-terraform/pull/1454

P0NDER0SA commented 4 months ago

Just updated the branch and we will see where this is tomorrow. Jimmy would like to discuss when we can push this to production. Let's discuss this!

P0NDER0SA commented 4 months ago

Steve is gonna do a smoke test on staging today and we will wait until next week to push this to production. Date to be confirmed.

sastels commented 4 months ago

upgraded on staging via a terraform PR (setting instance types and apply_immediately=true

Recommend we do the upgrade with a terraform PR off hours.

sastels commented 3 months ago

Will do it tonight at 8 pm EST

sastels commented 3 months ago

Doing in prod! one reader took 6m 56s next 12m49s writer 19m54s

4 SES processing errors and one Pinpoint processing error)

sastels commented 3 months ago

One note is that our soak test should be more like 3 emails / second, say, rather than 1 email / second to allow us to better anticipate alerts

sastels commented 3 months ago

Notifications dataset successfully refreshed last night :tada:

sastels commented 3 months ago

Free local memory now drops to 33.4G when doing the dataset refresh and does not crash the refresh :tada:

image.png