getsentry / sentry

Developer-first error tracking and performance monitoring
https://sentry.io
Other
39.17k stars 4.2k forks source link

Cron alerting based on successive failures and also accommodating of cron retries #52551

Open jvineet opened 1 year ago

jvineet commented 1 year ago

Problem Statement

We have been using the crons beta for our self-hosted sentry deployment, and I should start by saying it is a great addition to the sentry suite of services.

We have noticed a few minor issues with alerting, however. We have a lot of internal crons using the new sentry cron monitor and some of them have a very short interval (15 mins or lower). We have also set up automatic retries for most of our crons, and they are retried at least twice on failures. The existing alerting for cron monitors all revolve around checkin failure errors that are reported either from missed checkins or failed checkins. This can cause a lot of noise in our alerting system where we often get alerts on crons that might have failed a first attempt (like network issues, etc that may be outside of our control) but were successful on a retry attempt.

It would be great if we could get some cron specific alerting based on things like successive failures n number of times. This does not only help with tuning out some of the failed retries, but can also help with setting a somewhat relaxed alerting for crons that run every 15 mins or quicker. Since they run so frequently, we are often ok with them failing successively for 2 or 3 execution intervals, if they recover automatically and we do see this happen often as well.

Solution Brainstorm

I don't have any good ideas on how to circumvent this issue. I can add some of the workarounds we have put in place for now.

Product Area

Crons

getsantry[bot] commented 1 year ago

Assigning to @getsentry/support for routing ⏲️

getsantry[bot] commented 1 year ago

Routing to @getsentry/product-owners-crons for triage ⏲️

rjo100 commented 1 year ago

Thanks for the feedback! We've been working on this on our end and this is super helpful. Will keep this updated as we roll out relevant features.