Closed ben851 closed 2 days ago
Manual upgrades on staging while doing a 1 email / second soak test:
Started write upgrade 10:33 staging
One 500 error (api lambda ssl closing error) Locust got: HTTPError('500 Server Error: Internal Server Error for url: /v2/notifications/email')
10 “SSL connection has been closed unexpectedly” related celery errors at 10:35 (triggered alarm :/ )
78b009f6-ed85-4ac3-aaf6-3f56e4b503bc
6147850f-be62-4716-ac17-228eef9aa3fa
No failed emails
Next modify a reader! 10:41
Done 10:49
No errors
Last one (writer) 10:53
Finished 11:00 No errors
TL;DR
Made a PR to update the staging terraform to be xlarge.
So with the xlarge still there in AWS the plan says it won't do anything. So we can presumably upgrade manually and then merge a corresponding terraform PR without anything breaking.
Next will manually role back and then merge this PR while a soak test runs to see if we could just do it all with the PR.
PR to upgrade: https://github.com/cds-snc/notification-terraform/pull/1454
Just updated the branch and we will see where this is tomorrow. Jimmy would like to discuss when we can push this to production. Let's discuss this!
Steve is gonna do a smoke test on staging today and we will wait until next week to push this to production. Date to be confirmed.
upgraded on staging via a terraform PR (setting instance types and apply_immediately=true
SSL connection has been closed unexpectedly
celery errors but no delivery_receipt crashes this time (so no notifications were stuck in a sending state)Recommend we do the upgrade with a terraform PR off hours.
Will do it tonight at 8 pm EST
Doing in prod! one reader took 6m 56s next 12m49s writer 19m54s
One note is that our soak test should be more like 3 emails / second, say, rather than 1 email / second to allow us to better anticipate alerts
Notifications dataset successfully refreshed last night :tada:
Free local memory now drops to 33.4G when doing the dataset refresh and does not crash the refresh :tada:
Description
As an operator of Notify, I would like our quicksight data to stay up to date So that end users can leverage the features of quicksight *PLEASE read the acceptance criteria before doing this work. :)
WHY are we building?
The quicksight data refresh jobs are failing and crashing the RDS instance that the refresh is running on due to too much data
WHAT are we building?
Increase RDS instance size as a temporary measure to resolve the issue
VALUE created by our solution
Quicksight data works again!
Acceptance Criteria
QA Steps