Cron alerting based on successive failures and also accommodating of cron retries

jvineet commented 1 year ago

Problem Statement

We have been using the crons beta for our self-hosted sentry deployment, and I should start by saying it is a great addition to the sentry suite of services.

We have noticed a few minor issues with alerting, however. We have a lot of internal crons using the new sentry cron monitor and some of them have a very short interval (15 mins or lower). We have also set up automatic retries for most of our crons, and they are retried at least twice on failures. The existing alerting for cron monitors all revolve around checkin failure errors that are reported either from missed checkins or failed checkins. This can cause a lot of noise in our alerting system where we often get alerts on crons that might have failed a first attempt (like network issues, etc that may be outside of our control) but were successful on a retry attempt.

It would be great if we could get some cron specific alerting based on things like successive failures n number of times. This does not only help with tuning out some of the failed retries, but can also help with setting a somewhat relaxed alerting for crons that run every 15 mins or quicker. Since they run so frequently, we are often ok with them failing successively for 2 or 3 execution intervals, if they recover automatically and we do see this happen often as well.

Solution Brainstorm

I don't have any good ideas on how to circumvent this issue. I can add some of the workarounds we have put in place for now.

As a work around to mimic alerts on successive failures, we have tried using metrics based alerting where we set evaluation period for error count so that it can only be executed that many # of times within that period, i.e "failure count >=3 within a 1 hour evaluation period" for a 15 min cron interval. This can be cumbersome to set up individually for each cron and it works for some but not all cron schedules, because of limited evaluation interval options in the metric based alert configs. Also, it is hard to check for successive failures right now because we create error events on cron failure, but we don't create any info lvl event on successful cron executions. Maybe if we had those, we could set metrics based alerting with something like >=3 failures and success event = 0 for some evaluation period, which might help.
For reducing noise from retries, our current workaround involves using redis to store checkin id with a TTL just under a cron schedule's interval. We use a common harness for running most of our crons, and this harness is responsible for sending cron checkin events to sentry. This harness is also aware of the retry attempt it is running. The harness first checks if there is a check-in ID for the cron in redis. If it finds one, then it will use that checkin-id, else start a new ckeckin and add it to redis. It skips on sending failure events if there are more retries left and will send failure only on the last attempt. This solution is not super extensible especially for running crons outside of this harness, and we do have some of those. And we can't use the sentry-cli or snetry-sdk directly to send checkin events for our crons. Maybe we could have something in the sentry server's cron monitor config where we can specify the max retries in the monitor config. Sentry could then keep track of failure checkins it receives within a schedule interval and check with max retry count, before it marks a cron execution as a failure.

Product Area

Crons

getsantry[bot] commented 1 year ago

Assigning to @getsentry/support for routing ⏲️

getsantry[bot] commented 1 year ago

Routing to @getsentry/product-owners-crons for triage ⏲️

rjo100 commented 1 year ago

Thanks for the feedback! We've been working on this on our end and this is super helpful. Will keep this updated as we roll out relevant features.

getsentry / sentry