ben851 commented 4 months ago

Description

As an operator of Notify, I would like our quicksight data to stay up to date So that end users can leverage the features of quicksight *PLEASE read the acceptance criteria before doing this work. :)

WHY are we building?

The quicksight data refresh jobs are failing and crashing the RDS instance that the refresh is running on due to too much data

WHAT are we building?

Increase RDS instance size as a temporary measure to resolve the issue

VALUE created by our solution

Quicksight data works again!

Acceptance Criteria

[x] Verify if the storage size can be increased without actually doing the work in this card
[x] Production instance size is increased
[x] Quicksight data refereshes work

QA Steps

[x] Quicksight data refresh works

sastels commented 4 months ago

Manual upgrades on staging while doing a 1 email / second soak test:

Started write upgrade 10:33 staging

One 500 error (api lambda ssl closing error) Locust got: HTTPError('500 Server Error: Internal Server Error for url: /v2/notifications/email')

10 “SSL connection has been closed unexpectedly” related celery errors at 10:35 (triggered alarm :/ )

2 process-ses-result jobs failed, leaving 2 notifications in “sent” state 78b009f6-ed85-4ac3-aaf6-3f56e4b503bc 6147850f-be62-4716-ac17-228eef9aa3fa

No failed emails

Next modify a reader! 10:41

Done 10:49

No errors

Last one (writer) 10:53

Finished 11:00 No errors

TL;DR

Probably want to do the readers first, then the writer
One sec of api downtime, 2 process-ses-result crashed

sastels commented 4 months ago

Made a PR to update the staging terraform to be xlarge.

So with the xlarge still there in AWS the plan says it won't do anything. So we can presumably upgrade manually and then merge a corresponding terraform PR without anything breaking.

Next will manually role back and then merge this PR while a soak test runs to see if we could just do it all with the PR.

sastels commented 4 months ago

PR to upgrade: https://github.com/cds-snc/notification-terraform/pull/1454

P0NDER0SA commented 4 months ago

Just updated the branch and we will see where this is tomorrow. Jimmy would like to discuss when we can push this to production. Let's discuss this!

P0NDER0SA commented 4 months ago

Steve is gonna do a smoke test on staging today and we will wait until next week to push this to production. Date to be confirmed.

sastels commented 4 months ago

upgraded on staging via a terraform PR (setting instance types and apply_immediately=true

all three started modifying at once
had one 500 error in locust as before
had 6 SSL connection has been closed unexpectedly celery errors but no delivery_receipt crashes this time (so no notifications were stuck in a sending state)

Recommend we do the upgrade with a terraform PR off hours.