arachnys / cabot

Self-hosted, easily-deployable monitoring and alerts service - like a lightweight PagerDuty
MIT License
5.59k stars 594 forks source link

Option to automatically retry certain http status codes (Graphite) #524

Open Exocomp opened 7 years ago

Exocomp commented 7 years ago

I have some rare cases where Graphite returns a 500 internal server error, it is intermittent and happens rarely around 1 or so a day. The problem this creates is that Cabot treats that has a failure case and triggers an alert (Warning, Error or Critical depending on the check). This creates a false positive because it is not a true failure of the metric, on subsequent attempt it works and all is fine again but it wastes some time (and is annoying) due to the false positive, especially if it is in off hours and you get a critical alert.

In a perfect world there should never be a http status 500 error but it is unavoidable, could a feature be implemented in Cabot to allow automatic retries of such cases without giving up right away?

Exocomp commented 7 years ago

Perhaps something like the following:

dbuxton commented 7 years ago

The more general solution (albeit check-level) to this is that individual checks have the option to debounce based on whether or not the metric itself might be transient. I'd be inclined to use this in your situation.

This sounds more like an issue with graphite than anything else (and your graphite install in particular; I'm not sure we've observed the same problem).

In combination with what we want to do eventually with check plugins I think this functionality would be straightforward to add (see https://github.com/arachnys/cabot/pull/187 for some of the discussion/background to that)

All that said, I don't have a problem with looking at a PR for some general retry logic for the GraphiteCheck code...

Exocomp commented 7 years ago

The more general solution (albeit check-level) to this is that individual checks have the option to debounce based on whether or not the metric itself might be transient. I'd be inclined to use this in your situation.

I see what your saying however in this case http status codes are independent from the metric. The way I understand the debounce to work is that it will use the metric frequency which I suppose you could use but it not desirable in some cases to wait for the next check.

Right it is totally a Graphite hiccup, I send around 12M+ requests to it a day and it throwing a 500 or two is not bad but haven't found the reason so far.

dbuxton commented 7 years ago

Sure, as I say I don't think a bad idea to wrap this bit in a retry: https://github.com/arachnys/cabot/blob/2258511c18906afce81f20acc3c930242753d920/cabot/cabotapp/graphite.py#L13

Exocomp commented 7 years ago

I was able to resolve my graphite status 500 issue, no errors for the past few days :), would be nice to have though sometime in the future if you guys get to it.