department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 197 forks source link

Investigate PG Query times increasing #89310

Closed ericboehs closed 1 month ago

ericboehs commented 1 month ago

Since ~July 15th our DB latency has started to increase. There's also been a small spike at 2p CT since July 10th:

July 10th: Screenshot 2024-07-24 at 21 04 18@2x

Two weeks: Screenshot 2024-07-24 at 21 01 11@2x

July 15th: Screenshot 2024-07-24 at 21 02 40@2x

Tasks

Acceptance Criteria

ericboehs commented 1 month ago

We're scaling puma threads down to 8 today to see if this helps.

Notes

Timeline: July 11th: EKS Cluster Nodes: 45 bin/is_ready is deployed initial delay on readiness decreased 45->20; failure threshold increased 3->5

July 16th: EKS Cluster Nodes: 45->35 (afternoon)

July 17th: EKS Cluster Nodes: 35->30 (morning) EKS Cluster Nodes: 30->25 (afternoon)

July 18th: EKS Cluster Nodes: 25->20 (morning) EKS Cluster Nodes: 20->15 (afternoon)

July 19th: EKS Cluster Nodes: 15->10 (afternoon)

We went up with Puma threads on the 16th as. Maybe we went too high with thread count. We also increased AR and Redis connection pooling around that time.

There's a 3pm ET spike we need to look into. It's been spiking our latency every day since July 10th. I'm guessing a job was merged/deployed then. It could have also been enabled via feature flag.

ericboehs commented 1 month ago

Starting around the time (July 16th) we increased the connection pool/max threads to 12 (from 5), we started seeing problems with DB latency. As we tuned our nodes and pods down in size, the problems got worse.

image

Today we decreased threads and that helped us signficiantly. We're furthering tuning these values down to help with db latency.

image

Related: https://github.com/department-of-veterans-affairs/vsp-infra-application-manifests/pull/3021

ericboehs commented 1 month ago

It looks like a select few queries are skewing our metrics. When deselecting the top 4 offenders, our metrics don't look so bad. Not sure what to conclude from this:

Screenshot 2024-07-29 at 12 53 21@2x

The SELECT "vba_documents_upload_submissions"... show up in PgHero as a slow query but not the 526 submissions. Perhaps it's not slow enough on average to populate in PgHero.

Further investigation into these queries needs to be done (e.g. index/explain).

ericboehs commented 1 month ago

It appears the form526 queries are sporadic and skewing our metrics. Benefits Disablity is going to take a look. Closing for now.