Closed laineymajor closed 11 months ago
I've gone through the vets-api endpoints with the highest latencies and while some of them were rather high, none of them really had spikes (of errors or latency) during the last Pact Act surge. For now there is nothing else that needs to be bulkheaded.
An InProgressFormCleanup job runs at 2pm daily to clean up old forms! This is important so that we free up space in the PG db and don't leave unnecessary/old forms hanging around after they've been sent for processing and completed. The logs here show that this job is successfully running daily to cleanup forms at 2am.
Potentially adding another endpoint to the forms bulkhead
we do not have autoscaling for postgres, but it seems like this was a deliberate decision... due to performance issues.
We will close this on Thursday
PROBLEM STATEMENT
A number of issues occurred during the last PACT Act submission. (Lainey to link threads). In order to appropriately prepare for the next cohort's submission, the following actions are taken:
ADDITIONAL INFORMATION
Code Yellow (first time this is being done)
*10-10 form & Healthcare special enrollment period
Endpoints might need to be bulkheaded
[x] confirm that endpoints from last cohort are still bulkheaded.
[x] do we need to bulkhead additional endpoints? we should hear from VFS teams over the next week or so about their tests. WE ARE AWARE OF WHAT NEEDS TO BE ISOLATED
[x] review the scaling of the last cohort to determine if we need to setup additional bulkheads for 1) feature toggles, 2) in progress forms, 3) backend statuses, and 4) user maintenance windows (these endpoints aren't really fully owned by anyone, therefore fall on platform for now)
We have determined that we do not need to do additional load testing as we have very significant data from the last cohort's submission
Chris keeping an eye on current 10-10 testing
[x] any other unowned forms/endpoints? search, etc?
[x] are we confident that these endpoints can handle the load? or should we load test these endpoints
[x] review traffic/load patterns, specifically the load error rate for feature toggles and v0/search during the last pact act surge
[x] determine if we need bulkheads per ^^^ load patterns
[x] review the in progress forms cleanup jobs
[x] review current space in postgres *currently at 60% used
[x] Chris: does postgres have an auto upgrade for when we reach a certain capacity?
[x] review the monitors/alerts for pager duty
[x] Identify additional Endpoints on the critical path (e.g. /v0/search)
HCA endpoints for bulkheading