Open LindseySaari opened 1 day ago
The PG Query duration spiked yet again last night. After some investigation around job run times, here are some findings.
VBADocuments::UploadScanner
- Every 3 minutes
VBADocuments::UploadRemover
- Every 5 minutes
EVSS::DeleteOldClaims
– 2:00 AM
DeleteOldPiiLogsJob
– 2:20 AM
VBADocuments::UploadScanner
– Every 3 minutes
VBADocuments::UploadRemover
– Every 5 minutes
DecisionReview::FailureNotificationEmailJob
– 1:05 AM
Form526StatusPollingJob
– 3:00 AM
DeleteOldTransactionsJob
– 3:00 AM
Representatives::QueueUpdates
– 3:00 AM
VBADocuments::UploadRemover
- logs from last nightDeleteOldPiiLogsJob
- DeleteOldPiiLogsJob - logs here from last night from 2-4:00am error window
None of the other jobs (only DeleteOldPiiLogsJob) have error logsLooking at the DeleteOldPiiLogsJob
job, I wonder if this is at play... I did a .count on that table right now and it returned 687,000 records. There also is an index on that created at column. The deletion could be taking longer to update that index also... I wonder if the deletion should be batched maybe? This could cause table locking but that wouldn't relate to those VBA jobs.
Description:
As part of the actionable alerts investigation, we identified query duration spikes in Postgres during the 2:30-4:00 AM ET window (insert link to slack conversation here) via our Datadog monitor. The Appeals team owns the related job. We discovered PG query statement timeouts in the logs, but after investigating theories around locking (autovacuum or DB cleanup tasks, etc), this does not appear to be the cause. RDS logs also don’t reveal issues, and the jobs run fine throughout the day. Adjustments were made to job intervals (link to PRs). We need to work closely with the Appeals team and monitor the situation closely.
Acceptance Criteria: