always let handlers process recovery messages

Yelp / sensu_handlers

Custom Sensu Handlers to support a multi-tenant environment, allowing checks themselves to emit the type of handler behavior they need in the event json

Apache License 2.0

75 stars 31 forks source link

always let handlers process recovery messages #78

Closed somic closed 8 years ago

somic commented 8 years ago

Internally please see OPS-8850.

We currently suppress some recoveries from reaching jira and pagerduty (and all other handlers too) when they are filtered as disabled, silenced or dependency exists. This is not good.

Let's make sure all recoveries always reach our handlers.

Upstream filter method is at https://github.com/sensu-plugins/sensu-plugin/blob/master/lib/sensu-handler.rb#L29

solarkennedy commented 8 years ago

You can't really do this, otherwise you will silence an alert for say a day, and all day you will get IRC/Emails for "resolves" from nowhere.

somic commented 8 years ago

I see your point, good catch. I can think of several things we could do.

We could narrow down whitelisting recoveries only for jira and pagerduty handlers, where it won't be as annoying.

Or we could specifically target filter_dependencies method - and whitelist recoveries even if there is an active dependency.

I think the latter is cleaner and more philosophically correct - if an event has already been triggered, and its recovery message comes in, it shouldn't matter whether a dependency exists or doesn't exist, we should just pass recovery message to all relevant handlers.

solarkennedy commented 8 years ago

In some sense, this isn't a bug. The filter is correctly filtering events for things that should not be handled because a dependent thing is already failing.

So check_remote_desktop depends on check_vpn. If check_vpn is down, and check_remote_desktop flaps a billion times, why should any handler do anything about it?

I guess it depends on who you ask which is the more "surprising" behavior. To me it is surprising that any events are handled for a check that should be suppressed by a active dependency.

somic commented 8 years ago

I see your point. Reaping stale pagerduty incidents (ones that are open but whose corresponding events in sensu have already been resolved) is probably a better approach to what I was trying to accomplish.

My situation using your example is this. check_remote_desktop is detected first and it pages, then check_vpn is detected and it pages. If recovery of check_remote_desktop is detected and/or received before recovery of check_vpn, with current code check_remote_desktop page will remain open in pagerduty and will become stale.

I agree though this PR should not be merged. Discarding.

solarkennedy commented 8 years ago

Yea, this happens in nagios as well. To me the solution is proper alert_afters on the secondary check. (I needs to give the dependent check enough "time to fail")

So like, in this example we need to run check_vpn rapidly, and check_remote_desktop can run less rapidly, but needs a bit of an alert_after to give check_vpn enough time to go critical "first".

somic commented 8 years ago

@solarkennedy - came back here with the context of discussion on internal OPSIMP-499. What if we implement this logic only in pagerduty and jira handlers? In these two handlers, resolves to issues/incidents that do not exist in upstream system, will be noops and they won't annoy humans.

solarkennedy commented 8 years ago

Yea I guess that is fine. Just seems a bit strange that the silencing behavior acts one way for some handlers and a different way for others. As long as it is documented maybe?