aws / aws-toolkit-azure-devops

AWS Toolkit for Azure DevOps
Other
235 stars 100 forks source link

CloudFormationCreateOrUpdateStack intermittently reports that the update failed even though the stack update is still in progress #423

Open Slooz opened 2 years ago

Slooz commented 2 years ago

CloudFormationCreateOrUpdateStack intermittently reports that the update failed even though the stack update is still in progress

Screenshots Compare the timestamps of the Azure Pipelines logs and of the CloudFormation stack screenshots:

2021-07-16T00:04:25.1669564Z ##[section]Starting: Deploy CloudFormation template artifact 2021-07-16T00:04:25.1682103Z ============================================================================== 2021-07-16T00:04:25.1682446Z Task : AWS CloudFormation Create/Update Stack 2021-07-16T00:04:25.1682789Z Description : Creates a new AWS CloudFormation stack or updates the stack if it exists. 2021-07-16T00:04:25.1683056Z Version : 1.11.0 2021-07-16T00:04:25.1683264Z Author : Amazon Web Services 2021-07-16T00:04:25.1684784Z Help : Please refer to AWS CloudFormation User Guide for more details on working with AWS CloudFormation.

More information on this task can be found in the task reference.

Task Permissions

This task requires permissions to call the following AWS service APIs (depending on selected task options, not all APIs may be used):

  • cloudformation:CreateChangeSet
  • cloudformation:CreateStack
  • cloudformation:DeleteChangeSet
  • cloudformation:DescribeChangeSet
  • cloudformation:DescribeStacks
  • cloudformation:DescribeStackResources
  • cloudformation:ExecuteChangeSet
  • cloudformation:UpdateStack

The task may also require permissions to upload your application template to the specified Amazon S3 bucket. 2021-07-16T00:04:25.1686419Z ============================================================================== 2021-07-16T00:04:25.7914108Z Configuring credentials for task 2021-07-16T00:04:25.7917842Z ...configuring AWS credentials from service endpoint 'service_endpoint_id' 2021-07-16T00:04:25.7921389Z ...endpoint defines role-based credentials for role . 2021-07-16T00:04:25.7938858Z Configuring region for task 2021-07-16T00:04:25.7941522Z ...configured to use region us-east-2, defined in task. 2021-07-16T00:04:25.8059834Z Configuring credentials for task 2021-07-16T00:04:25.8105840Z ...configuring AWS credentials from service endpoint 'service_endpoint_id' 2021-07-16T00:04:25.8106878Z ...endpoint defines role-based credentials for role . 2021-07-16T00:04:25.8107542Z Configuring region for task 2021-07-16T00:04:25.8108460Z ...configured to use region us-east-2, defined in task. 2021-07-16T00:04:25.8225072Z Checking existence for stack stack_name 2021-07-16T00:04:26.1316421Z Stack exists, switching to update stack mode 2021-07-16T00:04:26.1322549Z Updating stack with template file 2021-07-16T00:04:26.1909771Z Generated url to template: url 2021-07-16T00:04:26.1914834Z Loading template parameters from task definition 2021-07-16T00:04:26.1968920Z Successfully loaded template parameters 2021-07-16T00:04:26.1985316Z Setting capability CAPABILITY_AUTO_EXPAND for stack 2021-07-16T00:04:26.8707078Z Waiting for stack stack_name to reach update complete status 2021-07-16T00:08:28.5549623Z Stack update request failed with error: 'Stack stack_name failed to reach update completion status, error: 'Resource is not in the state stackUpdateComplete'' Error: Stack stack_name failed to reach update completion status, error: 'Resource is not in the state stackUpdateComplete' 2021-07-16T00:08:28.5604512Z ##[error]Error: Stack stack_name failed to reach update completion status, error: 'Resource is not in the state stackUpdateComplete' 2021-07-16T00:08:28.5620817Z at Object. (/home/vsts/work/_tasks/CloudFormationCreateOrUpdateStack_7ef7cdfa-aa45-42c5-93c8-d7603643dd99/1.11.0/CloudFormationCreateOrUpdateStack.js:55:3093243) 2021-07-16T00:08:28.5621443Z at throw (native) 2021-07-16T00:08:28.5622282Z at n (/home/vsts/work/_tasks/CloudFormationCreateOrUpdateStack_7ef7cdfa-aa45-42c5-93c8-d7603643dd99/1.11.0/CloudFormationCreateOrUpdateStack.js:1:781) 2021-07-16T00:08:28.5623677Z at process._tickDomainCallback (internal/process/next_tick.js:135:7) 2021-07-16T00:08:28.5628101Z ##[section]Finishing: Deploy CloudFormation template artifact

image image image

Your Environment

Slooz commented 2 years ago

Another example in a different AWS account:

2021-07-16T16:06:32.4405816Z ##[section]Starting: Deploy CloudFormation template artifact 2021-07-16T16:06:32.4424947Z ============================================================================== 2021-07-16T16:06:32.4425352Z Task : AWS CloudFormation Create/Update Stack 2021-07-16T16:06:32.4425752Z Description : Creates a new AWS CloudFormation stack or updates the stack if it exists. 2021-07-16T16:06:32.4426117Z Version : 1.11.0 2021-07-16T16:06:32.4426405Z Author : Amazon Web Services 2021-07-16T16:06:32.4427982Z Help : Please refer to AWS CloudFormation User Guide for more details on working with AWS CloudFormation.

More information on this task can be found in the task reference.

Task Permissions

This task requires permissions to call the following AWS service APIs (depending on selected task options, not all APIs may be used):

  • cloudformation:CreateChangeSet
  • cloudformation:CreateStack
  • cloudformation:DeleteChangeSet
  • cloudformation:DescribeChangeSet
  • cloudformation:DescribeStacks
  • cloudformation:DescribeStackResources
  • cloudformation:ExecuteChangeSet
  • cloudformation:UpdateStack

The task may also require permissions to upload your application template to the specified Amazon S3 bucket. 2021-07-16T16:06:32.4429681Z ============================================================================== 2021-07-16T16:06:32.9894627Z Configuring credentials for task 2021-07-16T16:06:32.9896507Z ...configuring AWS credentials from service endpoint 'service_endpoint_id' 2021-07-16T16:06:32.9897962Z ...endpoint defines role-based credentials for role . 2021-07-16T16:06:32.9910212Z Configuring region for task 2021-07-16T16:06:32.9911200Z ...configured to use region us-east-2, defined in task. 2021-07-16T16:06:33.0024566Z Configuring credentials for task 2021-07-16T16:06:33.0055970Z ...configuring AWS credentials from service endpoint 'service_endpoint_id' 2021-07-16T16:06:33.0057056Z ...endpoint defines role-based credentials for role . 2021-07-16T16:06:33.0057719Z Configuring region for task 2021-07-16T16:06:33.0058623Z ...configured to use region us-east-2, defined in task. 2021-07-16T16:06:33.0158285Z Checking existence for stack stack_name 2021-07-16T16:06:33.2776309Z Stack exists, switching to update stack mode 2021-07-16T16:06:33.2779890Z Updating stack with template file 2021-07-16T16:06:33.3272588Z Generated url to template: url 2021-07-16T16:06:33.3277687Z Loading template parameters from task definition 2021-07-16T16:06:33.3311495Z Successfully loaded template parameters 2021-07-16T16:06:33.3312107Z Setting capability CAPABILITY_AUTO_EXPAND for stack 2021-07-16T16:06:33.9641445Z Waiting for stack stack_name to reach update complete status 2021-07-16T16:08:04.6043805Z Stack update request failed with error: 'Stack stack_name failed to reach update completion status, error: 'Resource is not in the state stackUpdateComplete'' Error: Stack stack_name failed to reach update completion status, error: 'Resource is not in the state stackUpdateComplete' 2021-07-16T16:08:04.6075566Z ##[error]Error: Stack stack_name failed to reach update completion status, error: 'Resource is not in the state stackUpdateComplete' 2021-07-16T16:08:04.6090336Z at Object. (/home/vsts/work/_tasks/CloudFormationCreateOrUpdateStack_7ef7cdfa-aa45-42c5-93c8-d7603643dd99/1.11.0/CloudFormationCreateOrUpdateStack.js:55:3093243) 2021-07-16T16:08:04.6090982Z at throw (native) 2021-07-16T16:08:04.6091902Z at n (/home/vsts/work/_tasks/CloudFormationCreateOrUpdateStack_7ef7cdfa-aa45-42c5-93c8-d7603643dd99/1.11.0/CloudFormationCreateOrUpdateStack.js:1:781) 2021-07-16T16:08:04.6092582Z at process._tickDomainCallback (internal/process/next_tick.js:135:7) 2021-07-16T16:08:04.6115151Z ##[section]Finishing: Deploy CloudFormation template artifact

image image image

bryceitoc9 commented 2 years ago

Hi @Slooz

A few questions:

It looks like we're not setting the timeout delay check for stack updates for some reason (only for create) but this means that this should check every 30 seconds for an hour. We can try adding an explicit extended timeout if necessary.

Slooz commented 2 years ago

Hi @bryceitoc9,

We're setting an increased timeout on the Azure Pipelines job from the default of one hour to eight hours via the timeoutInMinutes parameter.

We've also tried to increase the timeout on the task via the task's timeoutInMins parameter, but we're no longer setting this parameter because it doesn't work as described in #374.

15 minutes is typical for our stack updates. We have a lot of Fargate applications, and it takes time to deploy new versions of our applications. However, I was also able to reproduce this with a stack update that only took 30 seconds. I even observed that the task reported a failure shortly after CloudFormation had reported a successful stack update.

Yes, most of our other runs succeed. We noticed that this is a recent issue that has started within the last two weeks. We've been updating stacks for two years, and haven't had issues with stack updates that took up to an hour. It appears that these failures cluster around points in time. For example, we'll see a bunch of failures within the span of five minutes, but otherwise we don't see any issues.

It'd be great if the timeout could be extended. The CloudFormation timeout for ECS deployments is three hours, but the CloudFormationCreateOrUpdateStack limits us to one hour. Sometimes, the CloudFormationCreateOrUpdateStack task will time out and report a failure after one hour, even though though the CloudFormation stack update is still in progress.

I actually thought that the timeout is limited by the wait.stack-update-complete command, which says:

Wait until stack status is UPDATE_COMPLETE. It will poll every 30 seconds until a successful state has been reached. This will exit with a return code of 255 after 120 failed checks.

It doesn't seem like this timeout can be customized.

However, the task timeout should be greater than the max CloudFormation update timeout. Otherwise, the task won't be able to accurately report and track the actual state of the CloudFormation stack, which isn't really acceptable when the goal is to have robust CI/CD pipelines.

Slooz commented 2 years ago

Since reporting this issue, we've also done some investigating.

It looks like the wait.stack-update-complete command calls the DescribeStacks API: image

Additionally, a few weeks ago, we deployed a Lambda to monitor and apply tags to our stacks and resources. This Lambda calls the resource-groups:SearchResources API, which calls the DescribeStacks API. This Lambda runs infrequently, but when it does, it indirectly makes a lot of calls to the DescribeStacks API, which can be seen in CloudTrail.

I checked the timestamps of some of the task failures in CloudTrail, and each time there was a failure, there were a lot of calls to the DescribeStacks API by this Lambda.

I had a pipeline where the task reported a failure three times in a row even though the stack updated successfully each time. Each stack update only took 30 seconds. I looked up the failure timestamps in CloudTrail, and each time, there were a lot of calls by the Lambda to the DescribeStacks API. On the fourth try, when the task finally succeeded, there were no calls from the Lambda to the DescribeStacks API.

It looks like this Lambda might be overloading the DescribeStacks API, which eventually times out or reports a failure to the wait.stack-update-complete command, which causes the task to fail.

I'm not sure if the CloudFormationCreateOrUpdateStack has any control over this. It looks like the task is just receiving and showing the error that comes from AWS, and there doesn't appear to be a distinction between a true stack update failure and a timeout or failure of the DescribeStacks API.

Slooz commented 2 years ago

The stack failure timestamps occur exactly 30 seconds after the task calls the DescribeStacks API, so it really just looks like the DescribeStacks API is timing out.

For the first example in my original post, call to DescribeStacks at 2021-07-16T00:07:58Z, failure at 2021-07-16T00:08:28Z:

{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "",
        "arn": "",
        "accountId": "",
        "accessKeyId": "",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "",
                "arn": "",
                "accountId": "",
                "userName": ""
            },
            "webIdFederationData": {},
            "attributes": {
                "mfaAuthenticated": "",
                "creationDate": ""
            }
        }
    },
    "eventTime": "2021-07-16T00:07:58Z",
    "eventSource": "cloudformation.amazonaws.com",
    "eventName": "DescribeStacks",
    "awsRegion": "us-east-2",
    "sourceIPAddress": "",
    "userAgent": "",
    "requestParameters": {
        "stackName": "`stack_name`"
    },
    "responseElements": null,
    "requestID": "",
    "eventID": "",
    "readOnly": true,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "eventCategory": "Management",
    "recipientAccountId": ""
}

Second example, call to DescribeStacks at 2021-07-16T16:07:34Z, failure at 2021-07-16T16:08:04Z

{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "",
        "arn": "",
        "accountId": "",
        "accessKeyId": "",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "",
                "arn": "",
                "accountId": "",
                "userName": ""
            },
            "webIdFederationData": {},
            "attributes": {
                "mfaAuthenticated": "false",
                "creationDate": ""
            }
        }
    },
    "eventTime": "2021-07-16T16:07:34Z",
    "eventSource": "cloudformation.amazonaws.com",
    "eventName": "DescribeStacks",
    "awsRegion": "us-east-2",
    "sourceIPAddress": "",
    "userAgent": "",
    "requestParameters": {
        "stackName": "`stack_name`"
    },
    "responseElements": null,
    "requestID": "",
    "eventID": "",
    "readOnly": true,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "eventCategory": "Management",
    "recipientAccountId": ""
}