Per-alert descriptions fed to alerting integrations

philpennock commented 1 year ago

Is your feature request related to a problem? Please describe.

When the checkly alert comes to us via opsgenie, there's a lot of clicking to get through the dashboard and edit the alert to see what conditions led to a failure and then go checking internal playbooks for what went wrong.

Describe the solution you'd like

Per-alert "descriptions", as a freeform text field, which can be populated with whatever we think fit, and which are passed to the API of the alerting systems. Eg, in OpsGenie the new alert API has a "description" field with a 15,000 character limit. We'd like to be able to populate that with something other than View your failing check in the Checkly dashboard.

We'd then populate it with a concise human summary of what the alert is checking for, and a link to playbook documentation, to speed the time to comprehension when someone is woken up at 3am.

We might want multiple paragraphs for some situations, to have initial process checks be in the description. But we could live with a single paragraph if absolutely necessary. I've seen multiple paragraphs work very well in other monitoring systems.

Describe alternatives you've considered

Decorating elsewhere, which requires multiple sources of truth and a lot of complexity.

Additional context

Two existing issues relate to having descriptions, but not to the aspect of being able to pass that through to the alerting system and are more concerned with dashboard clutter. Those sound like short descriptions, not multiple paragraphs, else they'd just be a tab. So not the same. Those issues are: #241 and #133

tnolet commented 1 year ago

@philpennock thanks for mentioning this. Putting my research hat on: have you considered or are you using tools that do this "as-a-service", i.e. Firehydrant? Just very curious about how deep your integration needs go and where your runbooks live?

philpennock commented 1 year ago

No, didn't know of Firehydrant. Generally we want as few moving parts as possible between "thing which monitors" and "thing which alerts based on rules".

Our runbooks are in a git repo, on github. (They could as easily be on a github wiki, but being in the core repo makes it easier to handle some related tasks.) We require that recovery runbooks be something which can be sync'd to laptops and grep'd over. There are all sorts of nice features which are nifty, but if they don't have revision control with an audit trail, ability to not depend upon something else being up, or block us being able to edit/search as freely as text, then they've compromised "critical feature" for the sake of a "nice feature".

tnolet commented 1 year ago

@philpennock sorry for the late response. This is a great insight regarding how you use runbooks (version control, decentralized etc.).

ghost commented 1 year ago

we have a similar process to @philpennock. Our runbooks are stored on confluence or github (dependent on the team). We have a checkly account per product, and each product we have services that have devoted teams and are organizing products services via a group.

We would like to add a description to the opsgenie integration with a link to a runbook. So if a check fails the integration can forward a runbook to our support and incident response teams who would receive the opsgenie alerts. I imagine the description field at first may be static and we would have to create multiple opsgenie alert channels to accommodate the variation of runbooks per service ( e.g. opsgenie-service1, opsgenie-service2, etc ).

We hope for the description field to be dynamic just like the webhook alert channel integration. That way we can use tags to interpolate the runbook link in the description. This issue would have to be resolved as well to make that dream work

tnolet commented 1 year ago

Thanks for the ultra clear feedback!

tnolet commented 1 year ago

https://github.com/checkly/public-roadmap/issues/241

ghost commented 1 year ago

currently using tags to pass runbooks to an opsgenie alert as a workaround. But description would still be preferred

ghost commented 1 year ago

any word on this ? We realized that tags on Opsgenie have a character limit and if the url is too long then our response team wont know the full endpoint.

checkly / public-roadmap

Per-alert descriptions fed to alerting integrations #256