Kotaimen / awscfncli

Friendly AWS CloudFormation CLI
MIT License
58 stars 12 forks source link

Executing changeset fails due to throttling #68

Open mkielar opened 5 years ago

mkielar commented 5 years ago

This is an effect of #59. I have around 20 stacks that are currently ran within parallel Jenkins Pipieline. They have no cross-dependencies, so it's way faster to run them concurrently. This means they all start more or less at the same time, and they start failing because of the Throttling issue on Cloudformation API.

The example log:

Uploading to <edited-s3-key>/8eace00f35e63622adf76d987e6389a6.template  5527 / 5527.0  (100.00%)
Uploading <edited-s3-key>/24060b87943ee095687aa3fa75a6c1c2.template  4704 / 4704.0  (100.00%)Successfully packaged artifacts and uploaded to s3://dev-spdra-cf-templates.
ChangeSet Name: <edited-changeset-name>
ChangeSet Type: UPDATE
ChangeSet ARN: arn:aws:cloudformation:eu-west-2:<edited-account-id>:changeSet/<edited-changeset-name>
ChangeSet create complete.
An error occurred (Throttling) when calling the DescribeChangeSet operation (reached max retries: 4): Rate exceeded
Aborted!

It seems cfn-cli handles throttling issues when DescribeStackEvents is called for logging, but that's it. To be even worse, this exception is thrown by botocore after it already attempted max_retries times, with an expotential delay handler (i think) and all of them failed.

There seems to be no proper way out of this, but I'd like you to consider three options:

  1. Allow for parameterization of the max_retries value (example)
  2. Allow disabling DescribeStackEvents calls, while still waiting for the stack to finish (although not sure if that would actually reduce the number of CF API Requests)
  3. Use SNS/SQS for tailing events, instead of the DescribeStackEvents. This is a larger topic, but: 3.1. cfn-cli could either use SQS address provided by parameter (assuming users set up everything themselves), or 3.2. better, cfn-cli could actually provision SNS/SQS for itself and then use it (assuming it's run with a profile that allows for this).

What do you think?

Kotaimen commented 5 years ago

Hi @mkielar,

There are backoff in tail_stack_events so cfn-cli won't be throttled when it's working on a reasonably sized "nested stack". Are you deploying 20 stacks in a single account? AWS has a lot of throttling applied here and there at account level. Even call to DescribeStackEvents is disabled you are very likely to hit another transparent wall somewhere else (eg: running out of instance type limits, or throttled when creating too many DynamoDB tables to fast).

Still, I think 2 is a reasonable workaround as disable the stack events will greatly reduce call to Cloudformation APIs. But it may not work as expected as the "wait until stack deployment complete" features internally uses Waiter and it polls CloudFormation API:

 def wait(self, **kwargs):
        ........
        while True:
            response = self._operation_method(**kwargs)
            num_attempts += 1
            .........

I would like to know what kind of resources you are creating in the template? (eg: DDB table, EC2 instance...etc)

mkielar commented 5 years ago

Hi,

what I deploy is more or less:

This is one main stack that consists of 7 to 10 nested stacks depending on configuration (some nested stacks are only deployed on specific Conditions). This, times 20. As of now, the only throttling we observe is caused by Cloudformation API. Once started, all the stacks deploy properly.

The 20 is increasing as the nature of the platform I'm building is to allow standardized deployment of tools that serve different business logic, but have the same, standard APIs. Which means we're going to go from 20 to much more within some time.

What I'd appreciate though is pt.1 with either pt.2 or pt.3, as the first one actually gives me control on the number of retries, and the other two minimize the risk of throttling.

Alternatively, I could use Jenkins retry step, if I could identify that the cfn-init failed due to throttling when trying to executing the stack, or failed due to CF stack failing. That should be possible if I could differentiate cfn-cli exit codes on those occations. Do you have any docs on exit codes of cfn-cli?