aws-cloudformation / cloudformation-coverage-roadmap

The AWS CloudFormation Public Coverage Roadmap
https://aws.amazon.com/cloudformation/
Creative Commons Attribution Share Alike 4.0 International
1.11k stars 54 forks source link

AWS::CloudFormation - General Capability: Better handling of API limits and throttling. #573

Open Ricapar opened 4 years ago

Ricapar commented 4 years ago

1. AWS::CloudFormation - General Capability: Better handling of API limits and throttling.

This is a general feature/capability request, and not limited to any specific resource type.

2. Scope of request

CloudFormation supports up to 200 resources per Stack under the normal AWS account limits. It is possible to perform a stack update where a large majority (or all) of the resources in the stack have an update that needs to be applied.

Presently, depending on the types of resources being updated, it's possible that CloudFormation will fail to update one or more resources due to self-inflicted API throttling and result in rolling back the entire stack.

Samples:

AWS::SSM::Parameter

I could have a programatically generated CloudFormation stack that creates up to 200 AWS::SSM::Parameter resources based on output from a CI/CD process. One of the properties in the Parameter's value may be a last-updated timestamp or something to that effect:

  ExampleParameter:
    Type: AWS::SSM::Parameter
    Properties:
      Type: String
      Name: !Sub "/${AWS::StackName}/db-connection"
      Value: !Sub |
        {
            "last_updated": "${timestamp}",
            "host": "db.example.com",
            "port": "5432"
        }

# (repeat above x200)

AWS::ServiceCatalog::CloudFormationProvisionedProduct

I could have a stack with a large number of AWS::ServiceCatalog::CloudFormationProvisionedProduct resources that all have an update to a parameter or two, or perhaps all share a common parameter from the stack's input that is changing.

3. Expected behavior

If there's a situation created where CloudFormation is scheduled to do a large amount of resource updates, especially in cases where they are all resources of the same type, CloudFormation should be aware of API limitations and throttling limits and self-manage the rate at which the resources are updated in order to ensure that a stack update failure does not occur due to a service returning throttling errors.

In the AWS::SSM:Parameter example above, none of the parameters contain any explicit dependencies (DependsOn) or implicit (!Refs and such) to each other. If the ${timestamp} parameter changes CloudFormation should be smart enough to realize that it shouldn't do 200 calls to the SSM APIs at the same time as that would cause throttling.

3.1 Current Behavior

I experienced this with the AWS::ServiceCatalog::CloudFormationProvisionedProduct resource most recently, but it has also affected others as well.

CloudFormation will see that all of the resources need updating and proceed to update all of them at the same time (in parallel) as they do not have inter-dependencies. This results in API throttling from the service that provides those resources. CloudFormation and the APIs seem to have their own incremental back-off/retry logic and will continue to try to update those resources. Depending on how long the resources take to update, throttling won't resolve itself fast enough and CloudFormation will mark all of the resources as UPDATE_FAILED and then proceed to roll back the rest of the stack.

The failure would not happen if CloudFormation would self-throttle before the backend API throttling even becomes an issue.

4. Suggest specific test cases

Make a stack with 50 AWS::ServiceCatalog::CloudFormationProvisionedProduct resources. Trigger a stack update that forces all of them to update.

Make a stack with 200 AWS::SSM::Parameters . Trigger a stack update that forces them all to update.

6. Category

Management - CloudFormation

7. Any additional context (optional)

I'm well aware I can work around this issue by setting up a bunch of DependsOn conditions to "trick" CloudFormation into batching together updates of resources that would otherwise be done in bulk. Likewise I could refactor the stacks (easier said than done for resources that can't be imported because they don't support drift detection) into smaller stacks. However, regardless of the work-around options available I don't think these are the right solution.

When a stack is well within the out-of-box resource limits for a single stack, CloudFormation should behave properly as to not self-inflict throttling issues that cause a rollback.

Similarly, as tools like CDK evolve and mature more, having a for-loop that generates a ton of resources won't be unheard of, and the risk of creating a situation where a ton of resources of the same type update simultaneously becomes a lot more common.

glb commented 3 years ago

I've run into this problem as well with Route53. The recommended workaround is to create DependsOn links between the resources to prevent them from being created in parallel. It would be significantly better if CloudFormation limited its parallelism with what it knows about rate limits, so that customers would not need to introduce additional complexity into their stacks to work around incorrect behaviour.

Having DependsOn links for resources that don't actually have any dependencies makes it difficult for people new to the project to understand what the real dependencies are.

benbridts commented 3 years ago

@glb not addressing the core issue here, but for Route53 you could use a ~ResourceRecordSet~ AWS::Route53::RecordSetGroup, that should only be one API call

(disclaimer: I didn't verify this - but it should be straight forward to test)

benkehoe commented 3 years ago

This issue is somewhat related to https://github.com/aws-cloudformation/aws-cloudformation-resource-schema/issues/79 for resources that inherently must have their operations serialized, but is distinct for resources subject to non-inherent account limits.

glb commented 3 years ago

Thanks @benbridts ... we're creating a bunch of hosted zones (AWS::Route53::HostedZone) and resource records (AWS::Route53::RecordSet) within them; there is of course a natural dependency between zones and records, but we're still hitting rate limits (sometimes) when CloudFormation tries to create all the independent things in parallel.

benbridts commented 3 years ago

@glb I wrote that from memory and of course got it wrong. AWS::Route53::RecordSetGroup should batch requests so you don't get rate limited as early, but you might still run into the 5 requests/second rate limit if you have multiple of those running at once.

The original point of "CloudFormation cloud retry this (more), or seralize" still stands of course. And no workaround will completely solve the issue

hsiaoa commented 3 years ago

We ran into this issue with AWS::Serverless::HttpApi when our stack is trying to update a good portion of over 150 lambdas, each with their own endpoints. The HTTP API rate limit was hit.

Luckily it doesn't result in a failed deployment, but it somehow got stuck in retry/throttled mode. Our last deploy took almost 3 hours to complete.

The API rate limit was 5 per second for createApiKey & createResource, meaning that if well-coordinated our stack should not take more than 40 seconds in updating HttpApi.

Kintar commented 2 years ago

Just commenting to say we're running into the same issue with a stack that sets multiple SSM Parameter Store values. It's intermittent, but very annoying when it happens. CFN should definitely be aware of throttling on these resources and perform its own falloff and retry.

iaroslav-ai commented 2 years ago

Running into this issue when creating log filters.

masgustavos commented 1 year ago

Running into this issue when create Control Tower Controls. I'm limited to 10 at a time

RichardBradley commented 11 months ago

I had the same issue

I was able to work around by releasing with the rollback option set to "preserve successfully provisioned resources" and just releasing the same changes multiple times until it succeeded.

But I agree that this ought to be fixed inside CloudFormation