Add PagerDuty auto resolution for Issue Alerts

cseas commented 10 months ago

Problem Statement

Metric alerts have a way to auto-resolve them on PagerDuty automatically when the error volume goes down but the same isn't available for Issue Alerts. This creates a lot of extra work for our on-call engineers because they've to manually resolve each Issue Alert triggered by Sentry even though the issue's volume has gone down below the threshold it got triggered for.

Solution Brainstorm

When the issue's volume goes below the threshold it got triggered for, it should get auto-resolved.

I gather from this comment https://github.com/getsentry/sentry/issues/34372#issuecomment-1121348864 that the suggestion is to use Metric Alerts instead but that doesn't work for our use-case. Metric alerts report a collection of issues so it gets hard to identify new high volume issues with them if existing high volume issues exist already. Issue Alerts help us solve that by triggering alerts for specific issues, but right now they're a lot of trouble for our on-call engineers since they never get auto-resolved.

Product Area

Alerts

getsantry[bot] commented 10 months ago

Assigning to @getsentry/support for routing ⏲️

rachrwang commented 10 months ago

Hi @cseas - I'm Rachel, PM for the Issues product.

in Project -> General Settings, we support configuring "auto resolve" as the amount of time in which we stop receiving errors for a given issue, see: Screenshot 2023-09-06 at 7 50 45 AM

Would this address your use case, or is there some other definition that you'd like to use for "auto-resolve"?

I'd also like to be better understand what you mean by "they're a lot of trouble for our on-call" -> is that related to receiving the notification, looking at the issue in the product, or something else? We're actively trying to improve the alerts/ notifications experience - and would appreciate any feedback you have for us!

cseas commented 10 months ago

Hi @rachrwang, thanks for the quick response. I can see that the "Auto Resolve" setting was disabled for our projects, I've enabled it now.

Can you please confirm if this also triggers an auto-resolve for the PagerDuty incidents that got created for the issue from Issue Alerts?

I'd also like to be better understand what you mean by "they're a lot of trouble for our on-call"

As part of our on-call process, we track the "Time to Resolve" for each incident triggered by Sentry on PagerDuty. For metric alerts, this is not a problem because they automatically get resolved both on Sentry and PagerDuty when the error volume comes down. So on-call engineers don't need to worry about manually resolving them on PagerDuty.

For Issue Alerts however, once an alert is triggered on PagerDuty, it never automatically resolves even if the volume of the issue that triggered it comes down to zero. So for every PagerDuty incident that's triggered by an Issue Alert, the on-call engineer has to manually resolve it.

Now imagine the scenario that the Issue Alert got triggered, created a PagerDuty incident and the on-call engineer deployed some fix for the issue. Now the on-call engineer is expecting that the PagerDuty incident would get automatically resolved when the error volume comes down, but it doesn't. So only for Sentry's Issue Alerts, we need to educate all our on-call engineers that these need to be resolved manually. Otherwise the "Time to Resolve" metric that we track for incidents becomes hard to determine.

Incidents resolving automatically after a fix has been deployed is the general expectation since our other PagerDuty integrations (e.g. Grafana) work the same way.

getsantry[bot] commented 10 months ago

Routing to @getsentry/product-owners-issues for triage ⏲️

rachrwang commented 10 months ago

@cseas - thank you for the detailed explanation, it's very helpful context.

To what you wrote here: "So only for Sentry's Issue Alerts, we need to educate all our on-call engineers that these need to be resolved manually." -> to clarify, you're referring to manual resolution in PagerDuty?

Would you be open to connecting live with me to chat more about your use case, especially on how you've configured your issue & metric alerts? You can schedule with me directly here: https://calendly.com/rachel-wang-sentry/30min?month=2023-09, or feel free to email me at rachel.wang@sentry.io. My team is working on projects related to alerts & notifications, and I'd like to learn more about your use case to help inform our upcoming roadmap.

cseas commented 10 months ago

you're referring to manual resolution in PagerDuty?

Yes, the "time to resolve" of PagerDuty incidents is the only metric we track. We just use Sentry to funnel issues into PagerDuty. If the issue only auto-resolves on Sentry but the related Issue Alert triggered by it remains open on PagerDuty, that needs manual resolution and that's what we're looking for help on to automate it.

Happy to connect for more feedback and explain our setup. I've sent you a mail.

jvineet commented 2 months ago

@rachrwang Any update on where this feature request is at. We recently upgraded sentry to v24.4.2 from v23.3.1 and we find ourself in a similar predicament w.r.t sentry cron monitor alerts. The addition of failure and recovery tolerances in the cron monitor configuration has been a huge improvement, where sentry automatically creates a "monitor failure" issue after a successive # of cron failures and then automatically resolves the issue if there were a certain # of successful executions.

This works great from the cron observability perspective. We have set up Issue based alerts on failure events for some crons that send notifications to Slack and Pagerduty. But once the issue is resolved automatically by sentry, after there were subsequent successful cron executions, the related alerts to Slack and Pagerduty are not automatically resolved. This makes it somewhat cumbersome for the on-call engineers who need to go and resolve these alerts manually.

It would be great if there was an option to set auto-resolution for Issue based alerts when the underlying issue gets resolved automatically.

For more context, we can't use metrics based alerting here, for two broad reasons.

It looks like the monitor failure issues belong to issuePlatform dataset and metrics-based alerts don't work on this dataset.
We have 100s of crons that have a wide variety of cron intervals. So metrics-base alerting is just not a good fit for setting up alerts on cron monitor failures.

getsentry / sentry