Open Slooz opened 3 years ago
Another example in a different AWS account:
2021-07-16T16:06:32.4405816Z ##[section]Starting: Deploy CloudFormation template artifact 2021-07-16T16:06:32.4424947Z ============================================================================== 2021-07-16T16:06:32.4425352Z Task : AWS CloudFormation Create/Update Stack 2021-07-16T16:06:32.4425752Z Description : Creates a new AWS CloudFormation stack or updates the stack if it exists. 2021-07-16T16:06:32.4426117Z Version : 1.11.0 2021-07-16T16:06:32.4426405Z Author : Amazon Web Services 2021-07-16T16:06:32.4427982Z Help : Please refer to AWS CloudFormation User Guide for more details on working with AWS CloudFormation.
More information on this task can be found in the task reference.
Task Permissions
This task requires permissions to call the following AWS service APIs (depending on selected task options, not all APIs may be used):
- cloudformation:CreateChangeSet
- cloudformation:CreateStack
- cloudformation:DeleteChangeSet
- cloudformation:DescribeChangeSet
- cloudformation:DescribeStacks
- cloudformation:DescribeStackResources
- cloudformation:ExecuteChangeSet
- cloudformation:UpdateStack
The task may also require permissions to upload your application template to the specified Amazon S3 bucket. 2021-07-16T16:06:32.4429681Z ============================================================================== 2021-07-16T16:06:32.9894627Z Configuring credentials for task 2021-07-16T16:06:32.9896507Z ...configuring AWS credentials from service endpoint '
service_endpoint_id
' 2021-07-16T16:06:32.9897962Z ...endpoint defines role-based credentials for role . 2021-07-16T16:06:32.9910212Z Configuring region for task 2021-07-16T16:06:32.9911200Z ...configured to use region us-east-2, defined in task. 2021-07-16T16:06:33.0024566Z Configuring credentials for task 2021-07-16T16:06:33.0055970Z ...configuring AWS credentials from service endpoint 'service_endpoint_id
' 2021-07-16T16:06:33.0057056Z ...endpoint defines role-based credentials for role . 2021-07-16T16:06:33.0057719Z Configuring region for task 2021-07-16T16:06:33.0058623Z ...configured to use region us-east-2, defined in task. 2021-07-16T16:06:33.0158285Z Checking existence for stackstack_name
2021-07-16T16:06:33.2776309Z Stack exists, switching to update stack mode 2021-07-16T16:06:33.2779890Z Updating stack with template file 2021-07-16T16:06:33.3272588Z Generated url to template:url
2021-07-16T16:06:33.3277687Z Loading template parameters from task definition 2021-07-16T16:06:33.3311495Z Successfully loaded template parameters 2021-07-16T16:06:33.3312107Z Setting capability CAPABILITY_AUTO_EXPAND for stack 2021-07-16T16:06:33.9641445Z Waiting for stackstack_name
to reach update complete status 2021-07-16T16:08:04.6043805Z Stack update request failed with error: 'Stackstack_name
failed to reach update completion status, error: 'Resource is not in the state stackUpdateComplete'' Error: Stackstack_name
failed to reach update completion status, error: 'Resource is not in the state stackUpdateComplete' 2021-07-16T16:08:04.6075566Z ##[error]Error: Stackstack_name
failed to reach update completion status, error: 'Resource is not in the state stackUpdateComplete' 2021-07-16T16:08:04.6090336Z at Object.(/home/vsts/work/_tasks/CloudFormationCreateOrUpdateStack_7ef7cdfa-aa45-42c5-93c8-d7603643dd99/1.11.0/CloudFormationCreateOrUpdateStack.js:55:3093243) 2021-07-16T16:08:04.6090982Z at throw (native) 2021-07-16T16:08:04.6091902Z at n (/home/vsts/work/_tasks/CloudFormationCreateOrUpdateStack_7ef7cdfa-aa45-42c5-93c8-d7603643dd99/1.11.0/CloudFormationCreateOrUpdateStack.js:1:781) 2021-07-16T16:08:04.6092582Z at process._tickDomainCallback (internal/process/next_tick.js:135:7) 2021-07-16T16:08:04.6115151Z ##[section]Finishing: Deploy CloudFormation template artifact
Hi @Slooz
A few questions:
It looks like we're not setting the timeout delay check for stack updates for some reason (only for create) but this means that this should check every 30 seconds for an hour. We can try adding an explicit extended timeout if necessary.
Hi @bryceitoc9,
We're setting an increased timeout on the Azure Pipelines job from the default of one hour to eight hours via the timeoutInMinutes
parameter.
We've also tried to increase the timeout on the task via the task's timeoutInMins
parameter, but we're no longer setting this parameter because it doesn't work as described in #374.
15 minutes is typical for our stack updates. We have a lot of Fargate applications, and it takes time to deploy new versions of our applications. However, I was also able to reproduce this with a stack update that only took 30 seconds. I even observed that the task reported a failure shortly after CloudFormation had reported a successful stack update.
Yes, most of our other runs succeed. We noticed that this is a recent issue that has started within the last two weeks. We've been updating stacks for two years, and haven't had issues with stack updates that took up to an hour. It appears that these failures cluster around points in time. For example, we'll see a bunch of failures within the span of five minutes, but otherwise we don't see any issues.
It'd be great if the timeout could be extended. The CloudFormation timeout for ECS deployments is three hours, but the CloudFormationCreateOrUpdateStack
limits us to one hour. Sometimes, the CloudFormationCreateOrUpdateStack
task will time out and report a failure after one hour, even though though the CloudFormation stack update is still in progress.
I actually thought that the timeout is limited by the wait.stack-update-complete
command, which says:
Wait until stack status is UPDATE_COMPLETE. It will poll every 30 seconds until a successful state has been reached. This will exit with a return code of 255 after 120 failed checks.
It doesn't seem like this timeout can be customized.
However, the task timeout should be greater than the max CloudFormation update timeout. Otherwise, the task won't be able to accurately report and track the actual state of the CloudFormation stack, which isn't really acceptable when the goal is to have robust CI/CD pipelines.
Since reporting this issue, we've also done some investigating.
It looks like the wait.stack-update-complete
command calls the DescribeStacks
API:
Additionally, a few weeks ago, we deployed a Lambda to monitor and apply tags to our stacks and resources. This Lambda calls the resource-groups:SearchResources
API, which calls the DescribeStacks
API. This Lambda runs infrequently, but when it does, it indirectly makes a lot of calls to the DescribeStacks
API, which can be seen in CloudTrail.
I checked the timestamps of some of the task failures in CloudTrail, and each time there was a failure, there were a lot of calls to the DescribeStacks
API by this Lambda.
I had a pipeline where the task reported a failure three times in a row even though the stack updated successfully each time. Each stack update only took 30 seconds. I looked up the failure timestamps in CloudTrail, and each time, there were a lot of calls by the Lambda to the DescribeStacks
API. On the fourth try, when the task finally succeeded, there were no calls from the Lambda to the DescribeStacks
API.
It looks like this Lambda might be overloading the DescribeStacks
API, which eventually times out or reports a failure to the wait.stack-update-complete
command, which causes the task to fail.
I'm not sure if the CloudFormationCreateOrUpdateStack
has any control over this. It looks like the task is just receiving and showing the error that comes from AWS, and there doesn't appear to be a distinction between a true stack update failure and a timeout or failure of the DescribeStacks
API.
The stack failure timestamps occur exactly 30 seconds after the task calls the DescribeStacks
API, so it really just looks like the DescribeStacks
API is timing out.
For the first example in my original post, call to DescribeStacks
at 2021-07-16T00:07:58Z
, failure at 2021-07-16T00:08:28Z
:
{
"eventVersion": "1.08",
"userIdentity": {
"type": "AssumedRole",
"principalId": "",
"arn": "",
"accountId": "",
"accessKeyId": "",
"sessionContext": {
"sessionIssuer": {
"type": "Role",
"principalId": "",
"arn": "",
"accountId": "",
"userName": ""
},
"webIdFederationData": {},
"attributes": {
"mfaAuthenticated": "",
"creationDate": ""
}
}
},
"eventTime": "2021-07-16T00:07:58Z",
"eventSource": "cloudformation.amazonaws.com",
"eventName": "DescribeStacks",
"awsRegion": "us-east-2",
"sourceIPAddress": "",
"userAgent": "",
"requestParameters": {
"stackName": "`stack_name`"
},
"responseElements": null,
"requestID": "",
"eventID": "",
"readOnly": true,
"eventType": "AwsApiCall",
"managementEvent": true,
"eventCategory": "Management",
"recipientAccountId": ""
}
Second example, call to DescribeStacks
at 2021-07-16T16:07:34Z
, failure at 2021-07-16T16:08:04Z
{
"eventVersion": "1.08",
"userIdentity": {
"type": "AssumedRole",
"principalId": "",
"arn": "",
"accountId": "",
"accessKeyId": "",
"sessionContext": {
"sessionIssuer": {
"type": "Role",
"principalId": "",
"arn": "",
"accountId": "",
"userName": ""
},
"webIdFederationData": {},
"attributes": {
"mfaAuthenticated": "false",
"creationDate": ""
}
}
},
"eventTime": "2021-07-16T16:07:34Z",
"eventSource": "cloudformation.amazonaws.com",
"eventName": "DescribeStacks",
"awsRegion": "us-east-2",
"sourceIPAddress": "",
"userAgent": "",
"requestParameters": {
"stackName": "`stack_name`"
},
"responseElements": null,
"requestID": "",
"eventID": "",
"readOnly": true,
"eventType": "AwsApiCall",
"managementEvent": true,
"eventCategory": "Management",
"recipientAccountId": ""
}
CloudFormationCreateOrUpdateStack
intermittently reports that the update failed even though the stack update is still in progressScreenshots Compare the timestamps of the Azure Pipelines logs and of the CloudFormation stack screenshots:
Your Environment