Backpressure: overly susceptible to temporary issues

Environment

Steps to Reproduce

Over the last 30 days, we've experienced ~265 instances where backpressure has been marked as unhealthy due to a connection timeout when checking the health of a redis or rabbitmq cluster: https://cloudlogging.app.goo.gl/KNZDAduqrHWQn5At7

Each of these come with a corresponding pause and delay in ingestion:

1 timeout seems to trigger about 15s of ingestion latency.

There can also be instances where multiple trigger in succession, which seems to be enough to trigger a backlog large enough that it may page SRE while it burns down the backlog:

Expected Result

Some possible improvements we can make:

Add some retry functionality to avoid flakes
Require multiple events in a row to trigger the unhealthy state
Have backpressure fail open instead of closed (could have negative impact if the failures are caused by a real outage of a cluster).

I would probably start with adding retries on failure as it seems like the simplest thing that can work.

Actual Result

Backpressure pauses ingestion from a single failure.

Product Area

Ingestion and Filtering

Link

No response

DSN

No response

Version

No response

getsentry / sentry