Investigate PG Query Duration Spikes for VBA job

department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:

282 stars 203 forks source link

Description:

As part of the actionable alerts investigation, we identified query duration spikes in Postgres during the 2:30-4:00 AM ET window (insert link to slack conversation here) via our Datadog monitor. The Appeals team owns the related job. We discovered PG query statement timeouts in the logs, but after investigating theories around locking (autovacuum or DB cleanup tasks, etc), this does not appear to be the cause. RDS logs also don’t reveal issues, and the jobs run fine throughout the day. Adjustments were made to job intervals (link to PRs). We need to work closely with the Appeals team and monitor the situation closely.

Acceptance Criteria:

[ ] Investigate further potential causes of query spikes beyond locking.

[ ] Coordinate with the Appeals team to understand job behavior.

[ ] Monitor Postgres query durations between 2:30-4:00 AM ET to track changes.

[ ] Ensure any job adjustments are documented and tracked.

The PG Query duration spiked yet again last night. After some investigation around job run times, here are some findings.

VBA job runtime notes

VBADocuments::UploadScanner - Every 3 minutes VBADocuments::UploadRemover - Every 5 minutes

Other jobs that run in that window

EVSS::DeleteOldClaims – 2:00 AM DeleteOldPiiLogsJob – 2:20 AM VBADocuments::UploadScanner – Every 3 minutes VBADocuments::UploadRemover – Every 5 minutes DecisionReview::FailureNotificationEmailJob – 1:05 AM Form526StatusPollingJob – 3:00 AM DeleteOldTransactionsJob – 3:00 AM Representatives::QueueUpdates – 3:00 AM

Jobs with Error Logs

VBADocuments::UploadRemover - logs from last night
DeleteOldPiiLogsJob - DeleteOldPiiLogsJob - logs here from last night from 2-4:00am error window None of the other jobs (only DeleteOldPiiLogsJob) have error logs

Investigation Notes

Looking at the DeleteOldPiiLogsJob job, I wonder if this is at play... I did a .count on that table right now and it returned 687,000 records. There also is an index on that created at column. The deletion could be taking longer to update that index also... I wonder if the deletion should be batched maybe? This could cause table locking but that wouldn't relate to those VBA jobs.

department-of-veterans-affairs / va.gov-team