department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 201 forks source link

Adjust breakers #79152

Open rmtolmach opened 6 months ago

rmtolmach commented 6 months ago

Problem

Our breakers aren't sensitive enough and don't trip as fast as we need them to. There were some events over the weekend. Link to #oncall thread. The alerts could have been prevented if the breakers tripped faster.

Solution

Make breakers trip faster (20%? 40%?) or adjust seconds before retry?

LindseySaari commented 6 months ago

Scenario: EVSS (and a few other services) went down. The breakers config trips (by default) at a 50% failure rate. We were seeing latency for Vets API because there was a puma backlog and we couldn't scale because our readiness probe hits /v0/healthcheck. Since we hit an endpoint and not the puma stats endpoint, the healthcheck was not successful because the readiness probe timeout was reached. We think that the EVSS failure rate may have been bouncing somewhere near the 50% request failure rate, which kept tripping and releasing the breaker, never allowing for us to burn down the backlog and healthchecks could not complete successfully (not allowing a scale). We want to see if we should decrease the threshold for some of the offending services that frequently have issue. We could potentially start with 40 and go from there. This would allow us to trip the breaker sooner to avoid a cascading latency event caused by the puma backlog and successfully scale based on our HPA metrics.

rmtolmach commented 3 hours ago

The breakers gem makes this easy to change. Modify this line.