Open rmtolmach opened 6 months ago
Scenario: EVSS (and a few other services) went down. The breakers config trips (by default) at a 50% failure rate. We were seeing latency for Vets API because there was a puma backlog and we couldn't scale because our readiness probe hits /v0/healthcheck. Since we hit an endpoint and not the puma stats endpoint, the healthcheck was not successful because the readiness probe timeout was reached. We think that the EVSS failure rate may have been bouncing somewhere near the 50% request failure rate, which kept tripping and releasing the breaker, never allowing for us to burn down the backlog and healthchecks could not complete successfully (not allowing a scale). We want to see if we should decrease the threshold for some of the offending services that frequently have issue. We could potentially start with 40 and go from there. This would allow us to trip the breaker sooner to avoid a cascading latency event caused by the puma backlog and successfully scale based on our HPA metrics.
Problem
Our breakers aren't sensitive enough and don't trip as fast as we need them to. There were some events over the weekend. Link to #oncall thread. The alerts could have been prevented if the breakers tripped faster.
Solution
Make breakers trip faster (20%? 40%?) or adjust seconds before retry?