AcalephStorage / consul-alerts

A simple daemon to send notifications based on Consul health checks
GNU General Public License v2.0
827 stars 191 forks source link

Enhance PagerDuty notifier with retry configuration #241

Open StephenWeber opened 5 years ago

StephenWeber commented 5 years ago

Current behavior: when there is an error making requests of the PagerDuty API, the client by default does not retry the request. consul-alerts logs the error.

Preferred behavior: The underlying library providing PagerDuty support has logic built-in to retry requests when there is an error. By default, it does not perform retries, but through configuration values it can be set to do so. We would like to provide a way to optionally configure the client with these settings.

Primary benefit for consul-alerts users: PagerDuty returns an error when they rate-limit your service key. Their documentation says that the correct thing to do is to retry the request after a period of time.

The client fields in question are:

client.MaxRetry = 5 // set max retries to 5 before failing, Defaults to 0.
client.RetryBaseInterval = 5 // set first retry to 5s. Defaults to 10s.

We suggest two optional fields in the PagerDuty notifier configs that would look like this in Consul:

consul-alerts/config/notifiers/pagerduty/max-retry
consul-alerts/config/notifiers/pagerduty/retry-base-interval

Since existing customer configurations don't have those keys, the code's default should be to retain the existing behavior. That means that if those keys are not present, that the PagerDuty client should not retry on errors.

The basic components of our suggested change are:

  1. adding the stated keys as optional parts of the PagerDutyNotifier struct and its Unmarshal-er
  2. setting either of those values on the client if their key is present
  3. adding a test for PagerDuty configuration with and without those values present.

I have some example code that I could also paste here to illustrate these steps, or could supply as a PR.

fusiondog commented 5 years ago

Sounds like a good improvement. I would appreciate the PR and seems like it should be straight forward to review and approve.

StephenWeber commented 5 years ago

@fusiondog comprehensive PR attached -- in addition to unit tests, I've also used a local instance to try out variations on set/unset values. I'm up for any adjustments to better match local source or config conventions.

nh2 commented 5 years ago

Is it possible to make it retry infinitely?

StephenWeber commented 5 years ago

@nh2 the way the underlying gopherduty client is written, it cannot be retried indefinitely. However, you can set a high # for max-retries and a reasonable retry-base-interval that will exponentially back off.

For instance, max-retry=15 and retry-base-interval=10 will start with a 10-second delay for the first retry, and will end up with 3.8 days between the 14th and 15th retry.

My own experiments show that so long as a retry happens after 60 seconds, it's likely to succeed. It's hard to determine what an appropriate amount of retries would be, though the current rate limits are about 150 requests-per-minute - so as long as you will retry past number of requests / 150 minutes, you'll likely be successful.

StephenWeber commented 5 years ago

@fusiondog @darkcrux there is also a PR against the gopherduty library that will allow 403 (Rate limit response) to be treated as a retry-able error: https://github.com/darkcrux/gopherduty/pull/3

nh2 commented 5 years ago

@StephenWeber Thanks for the detail, that is good to know; infinite retry would still be best for simplicity so that we don't have to rely on specific timings or logic.