Yelp / sensu_handlers

Custom Sensu Handlers to support a multi-tenant environment, allowing checks themselves to emit the type of handler behavior they need in the event json
Apache License 2.0
75 stars 31 forks source link

filter alerts with "Execution timed out" in pagerduty and jira handlers #110

Open somic opened 7 years ago

somic commented 7 years ago

I would like to open this up for discussion.

If a check is taking longer to run than expected, it often would exit 2 (critical) with output of "Execution timed out".

This comes from sensu-spawn gem - https://github.com/sensu/sensu-spawn/blob/master/lib/sensu/spawn.rb#L163

What if we filter these out of pagerduty and jira handlers? After all when we get this, we can't be certain it's the check that failed - in fact almost always it's not the actual check but a bad or hung ec2 instance etc.

A positive outcome of this is we would cut down on (frequently) unactionable tickets and pages. Also, if we ever decide to do more auto-remediation on these, we would be more confident autoremediation is attempted when it's needed - since running autoremediation in response to "execution timed out" (essentially an unknown exit code from a check) may not always be desirable.

A negative outcome would be we would lose implicit info about some problems that oncall often derives from seeing "Execution timed out".

Discuss.

@solarkennedy @bobtfish

bobtfish commented 7 years ago

I'm pro this. Almost always when we get "Execution timed out" that is non-transient, we'll get keepalive alerts at the same time (or shortly afterwards).

solarkennedy commented 7 years ago

I'm down.

cabecada commented 7 years ago

yes we have been bitten by the same issue :) and have had similar discussions.

one was mitigated by increasing the spawn limit. https://github.com/sensu/sensu-puppet/issues/727

other was making use of https://github.com/sensu-extensions/sensu-extensions-check-dependencies so dependent services do not get alerted if primary service (eg .network) is already an issue and it had been silenced.