Open somic opened 7 years ago
I'm pro this. Almost always when we get "Execution timed out" that is non-transient, we'll get keepalive alerts at the same time (or shortly afterwards).
I'm down.
yes we have been bitten by the same issue :) and have had similar discussions.
one was mitigated by increasing the spawn limit. https://github.com/sensu/sensu-puppet/issues/727
other was making use of https://github.com/sensu-extensions/sensu-extensions-check-dependencies so dependent services do not get alerted if primary service (eg .network) is already an issue and it had been silenced.
I would like to open this up for discussion.
If a check is taking longer to run than expected, it often would exit 2 (critical) with output of "Execution timed out".
This comes from sensu-spawn gem - https://github.com/sensu/sensu-spawn/blob/master/lib/sensu/spawn.rb#L163
What if we filter these out of pagerduty and jira handlers? After all when we get this, we can't be certain it's the check that failed - in fact almost always it's not the actual check but a bad or hung ec2 instance etc.
A positive outcome of this is we would cut down on (frequently) unactionable tickets and pages. Also, if we ever decide to do more auto-remediation on these, we would be more confident autoremediation is attempted when it's needed - since running autoremediation in response to "execution timed out" (essentially an unknown exit code from a check) may not always be desirable.
A negative outcome would be we would lose implicit info about some problems that oncall often derives from seeing "Execution timed out".
Discuss.
@solarkennedy @bobtfish