AxtonGrams / terraform-provider-wiz

Terraform provider for managing Wiz resources
Mozilla Public License 2.0
25 stars 18 forks source link

Introduce an backoff and retry mechanism #96

Open jschoombee opened 1 year ago

jschoombee commented 1 year ago

Description

When using the wiz_users data source and querying for a large number of users (>10000) more often than not, Wiz would return a handful of errors as example below. The users search for would be affected at random.

Believe it would be good practice to introduce some kind of exp. backoff/retry, possible a package like github.com/hashicorp/go-retryablehttp. It could be configured to start with, to retry twice before returning an error Diagnostic condition.

{"message": "oops! an internal error has occurred. for reference purposes, this is your request id: 9aa5f320-f080-4b62-a8c0-c58c05224704","extensions": {"code": "DOWNSTREAM_SERVICE_ERROR","exception": {"message": "oops! an internal error has occurred. for reference purposes, this is your request id: 9aa5f320-f080-4b62-a8c0-c58c05224704"}}}

Potential Provider Configuration

backoff := retryablehttp.DefaultBackoff
backoff.MaxRetries = 2
backoff.Backoff = func(min, max time.Duration, attemptNum int, resp *http.Response) time.Duration {
  // exponential backoff
  backoff := time.Duration(math.Pow(2, float64(attemptNum-1))) * time.Second
  if backoff > max {
      return max
  }
  return backoff
}
retryClient := retryablehttp.NewClient()
retryClient.RetryMax = 2
retryClient.Backoff = backoff

// call the api with the retry handler
resp, err := retryClient.Do(request)

References

Community Note

gramsa49 commented 1 year ago

github.com/hashicorp/go-retryablehttp is currently in use by the provider: https://github.com/AxtonGrams/terraform-provider-wiz/blob/main/internal/config/config.go#L78-L87

I return a standard client, but the client is actually the Retryable HTTP Client.

Early on I had to do this because Wiz would return an HTTP 429 (too many requests).

We can look at additional configuration parameters for the client to possibly handle the condition you outlined in this issue.

The retryable http client is limitedin what error codes it handles. I suspect above you are getting a HTTP 500.

jschoombee commented 1 year ago

I do recall now the 'too many requests' were indeed handled when debugging a while back, didn't check the config, thanks. I did work around the issue by dropping TFE_PARALLELISM down to 3 from 10. But will keep an eye on it and see if it can be scoped to just this condition, if needed.

gramsa49 commented 1 year ago

I haven't explicitly configured a Backoff mechanism in the provider config; you could explore doing that. I don't know what, if any, default

For the data sources, I have not implemented pagination; that could be another strategy to help with this. Smaller chunks, more requests/responses, versus large responses. Many of the APIs have limits on how many results they will return.

jschoombee commented 1 year ago

In chatting with Wiz support, the internal error received by the provider is caused by a 429 downstream on Wiz's backend and they're working on reducing this outlying condition/error rate.

I do agree with you, pagination is probably the first course of action with a performance gain too.