Add rate limit for calls to Buildkite's REST API

rjackson90 commented 8 months ago

My organization uses Buildkite quite extensively, and we have hundreds of pipelines. Last November, Buildkite announced a new policy of rate-limiting requests against the REST API to 200 requests/minute, as documented here. Our use of this provider to manage Buildkite resources causes us to exceed this rate limit. We've looked at re-organizing our terraform configuration to reduce the number of resources affected by any particular plan operation, but there's only so much we can do without arbitrarily carving up our configuration in awkward ways.

I would like to request that the terraform provider add support for this rate-limit. As a user, I want the provider to comply with Buildkite's policies for API usage so that I don't have to worry about accidentally exceeding limits.

In researching this issue, I found a proposal for core Terraform that has an interesting discussion on this kind of problem. In that issue discussion, the participants appear to agree that rate limits are best handled by the providers in accordance with the policies of the underlying service.

james2791 commented 8 months ago

Hey @rjackson90 👋

Thanks for bringing this up and that proposal from the Terraform folks is certainly an interesting read! With our provider; the only calls to the REST API that are made are through test suiteand pipeline (for provider_settings) resource actions; and using the meta datasource (if you are using it in your Terraform files/modules).

The same premise will also apply to the GraphQL API albeit on complexity point calculation; not RPM as implemented on the REST side - the majority of API requests on the provider being on the former. That said, we'll see what is potential with our provider!

mcncl commented 7 months ago

Hey @rjackson90! Could you share a stripped out version of your config? Are you using the timeouts field? The resources which use the REST API should be more manageable with the setting of a custom timeout, such as:


timeouts {
  read = 60m
}

BoraxTheClean commented 7 months ago

@mcncl We don't have a timeouts {} block on our buildkite_pipeline resource. That's a good suggestion.

Given that we have around 250 pipelines and our rate limit as of 3/31 will be 200, would you think a read timout of 90s would be sufficient?

BoraxTheClean commented 7 months ago

I just tried upgrading our terraform provider to your latest version, and added a timeout block to my buildkite_pipeline resource, and I got a pre-plan error Blocks of type "timeouts" are not expected here.

  timeouts {
    read = "90s"
  }

mcncl commented 7 months ago

Hey @BoraxTheClean!

Apologies for the confusion, the timeouts is set on the provider itself as shown in this example: https://registry.terraform.io/providers/buildkite/buildkite/latest/docs/guides/upgrade-v1#configurable-api-retry-timeouts

provider "buildkite" {
  organization = "buildkite"
  timeouts = {
    read = "90s"
  }
}

BoraxTheClean commented 6 months ago

I've fully set this up on my end, I'll let you know if I have any issues after the 3/31 rate limit drops!

BoraxTheClean commented 5 months ago

I didn't have issues right after 3/31, but I am now getting rate limit issues. I bumped the timeouts to be 600s across the board. My plan of about 440 resources, is tripping the rate limit after 7 minutes of clock time.

provider "buildkite" {
  organization = "curative-inc"
  timeouts = {
    read   = "600s"
    create = "600s"
    delete = "600s"
    update = "600s"
  }

}

This makes my project virtually unplannable without targeting specific resources.

mcncl commented 5 months ago

Hey @BoraxTheClean!

Yeah, that rate limit will be due to the amount of resources being created/planned. The only solution at the moment would be to use the -target flag and name your resource[s]. I'd imagine this could be done as a script too.

resource "buildkite_cluster" "test" {
  name        = "Primary Cluster"
  description = "Runs the monolith build and deploy"
  emoji       = "🚀"
  color       = "#bada55"
}

resource "buildkite_cluster_queue" "default" {
  cluster_id = buildkite_cluster.primary.id
  key        = "macos"
}

terraform plan -target=buildkite_cluster.test -target=buildkite_cluster_queue.default

There is an open issue on terraform to accept globs, but it's 9 years old now so I don't think an alternative solution is going to be available any time soon.

apparentlymart commented 5 months ago

Hello! I ended up here by following a backlink from https://github.com/hashicorp/terraform/issues/31094, where I was discussing the general problem of rate limits and some specific strategies that the Azure provider might follow.

(Also :wave: Hello Buildkite folks! I used to be a happy customer at my old employer, many years ago, and wrote my own Terraform provider while there. I archived it today after learning that this official one exists!)

The main reason I'm here is that I wanted to try to clarify how I'd interpret the general idea I was discussing in the other issue for Buildkite's API design in particular. I see that the provider is primarily using the GraphQL API rather than the REST API, and so I assume that it's the GraphQL resource limits that are in play here, rather than the REST API limits.

I'm referring to Accessing limit details and understand from this that Buildkite uses a time-window-based rate limit strategy, in which case the only two options for reacting to limit exhaustion would be to either fail with an error -- which I assume is how the provider is currently behaving, based on this issue -- or to sleep for RateLimit-Reset seconds and then retry the request.

If this provider were to switch to the second strategy of sleeping until the limit resets, I guess that would mean a worst-case delay of five minutes, which is quite a long time by typical Terraform request standards, but probably still better than an immediate hard failure. Would you agree?

If so, I wonder what you think about altering the provider's API client to notice when the response is a rate limit error, and to sleep until the limit resets and then retry. Does that seem feasible?

(This provider also seems like it would benefit a lot from a means for Terraform to batch together provider requests so that the provider can perform fewer total GraphQL queries, which is something I've wanted to enable for a long time but is tricky with Terraform's execution model. However, since the rate limit model for GraphQL is measured in complexity points rather than as a request count I assume batch requests would only potentially help with performance -- performing fewer API round-trips -- and would not help to avoid hitting the rate limits.)

I hope this is helpful! I am of course not trying to compel you to do anything in particular with this issue, but since much of the discussion about rate limits so far was focused on Microsoft Azure I was curious to see what it might look like to handle this for a different API with a different rate limit strategy.

bill-scalapay commented 4 months ago

Hello there, we've also started running into this issue when trying to apply changes to the Buildkite pipelines for one of our Terraform monorepos. Unfortunately, using the timeouts block in the Buildkite provider configuration does not seem to help as the provider just generates an error when the API returns an HTTP 429 response. Also, the suggestion to target resources does not always help in our case—the resources managed by our root module include a master pipeline that triggers sub-pipelines based on filename diffs, so planning changes for the master pipeline brings the entire Terraform configuration into the plan.

In line with what @apparentlymart said, the AWS Go SDK (used by the AWS Terraform provider) automatically handles HTTP 429 responses (among other errors) by waiting and retrying with an exponential backoff algorithm. It would be great if the Buildkite provider implemented similar functionality, even if much simpler, as currently the provider is unusable for larger Terraform configurations.

BoraxTheClean commented 4 months ago

I just want to second this. Targeting is a partial solution for us, but having to do that is not compatible with our tooling, so it creates more operational burden. It would be ideal and expected for the provider to handle the rate limits for us.

mcncl commented 4 months ago

@bill-scalapay is this the exponential backoff retry that you're referring to: https://github.com/hashicorp/terraform-provider-aws/tree/main/internal/retry?

bill-scalapay commented 4 months ago

@mcncl The doc I read mentioned that the AWS Go SDK automatically handles retries for certain error codes, but maybe they also implement something in the provider. Here's the doc I read: https://github.com/hashicorp/terraform-provider-aws/blob/main/docs/retries-and-waiters.md#default-aws-go-sdk-retries

And here's what looks like the retry code from the AWS Go SDK: https://github.com/aws/aws-sdk-go/blob/v1.53.9/aws/request/retryer.go https://github.com/aws/aws-sdk-go/blob/v1.53.9/aws/client/default_retryer.go

EDIT: The RetryRules function in the second link from the AWS Go SDK code looks to be where the backoff magic happens

apparentlymart commented 4 months ago

The AWS provider's handling of this is rather complicated because it has to deal with the huge variety of different strategies across many different AWS services.

So while I don't mean to say it isn't an interesting reference, I would hope that something for the Buildkite API could be considerably simpler because there is exactly one well-defined rate limiting model. In particular, I wouldn't expect exponential backoff to be needed here because the API already indicates how long the client should wait for the rate limit to have been reset.

One potential way to complicate it would be to write the "requested complexity" rules as code and have the provider notice that a particular request is likely to breach the limit and do some clever request prioritization like prefering smaller requests over larger ones, but it's not clear to me that this would have any significant advantage over just trying requests as soon as they become pending, noticing a rate limit error, and then waiting RateLimit-Reset seconds before trying again. :thinking:

There is of course some risk here that tightly coupling the provider behavior to the current rate limit model would make it harder to evolve the rate limit model in future, but I trust that you all know better than I do how likely it is that the Buildkite GraphQL API would adopt a different model in future. :grinning:

buildkite / terraform-provider-buildkite

Add rate limit for calls to Buildkite's REST API #477