Open vvigilante opened 4 weeks ago
Hi @vvigilante , thanks for reaching out. As explained in the How custom resources work, here are my 2cents -
Since the update is happening in between when the other event is happening, there is no way to send the signal to CFN. So its waiting for response and then fails, hence the conflict.
CDK has CfnResponse module which is a library that simplifies sending responses to the custom resource that invoked your Lambda function. The module has a send method that sends a response object to a custom resource by way of an Amazon S3 presigned URL (the ResponseURL). But there is no such module for async operations as far as I know.
Please feel free to share your thoughts on this and correct me if anything is misunderstood.
cc: @pahud
We have noticed that when we cancel a CloudFormation stack update while an on-going WaiterStateMachine is running, the on-going WaiterStateMachine is not cancelled/stopped and it continues to call our IsComplete handler.
OK so sounds like:
Is it correct?
How did you cancel that update? From the console?
I tried to simulate your use case using the PoC below which would immediate create/delete the CR but would not complete on resource update.
export class MyStack extends cdk.Stack {
constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
super(scope, id, props);
// Define the custom resource provider
const provider = new cr.Provider(this, 'MyProvider', {
onEventHandler: new lambda.Function(this, 'OnEventHandler', {
runtime: lambda.Runtime.NODEJS_LATEST,
handler: 'index.handler',
code: lambda.Code.fromInline(`
exports.handler = async (event) => {
console.log('Received event:', event);
// Generate a physical resource ID
const physical_id = 'my-custom-resource-id';
return { 'PhysicalResourceId': physical_id };
};
`),
}),
isCompleteHandler: new lambda.Function(this, 'IsCompleteHandler', {
runtime: lambda.Runtime.NODEJS_LATEST,
handler: 'index.handler',
code: lambda.Code.fromInline(`
exports.handler = async (event) => {
console.log('Received event:', event);
// Always return false when update
const is_ready = event.RequestType == 'Update' ? false : true;
return { 'IsComplete': is_ready };
};
`),
}),
totalTimeout: cdk.Duration.hours(1),
});
new CustomResource(this, 'MyResource', {
serviceToken: provider.serviceToken,
properties: {
foo: 'bar',
},
});
}
}
I cancelled the update from the console and got
But the custom resource update operation is actually not cancelled
This is because of the CloudFormation Custom Resource behavior: When you cancel a stack update, CloudFormation will attempt to roll back to the previous stable state. For custom resources, this means:
If the custom resource was being created, CloudFormation will send a "Delete" request to your custom resource provider.
If the custom resource was being updated, which is your use case, CloudFormation will send another "Update" request to revert to the previous state.
If the custom resource was being deleted, CloudFormation will send a "Create" request to recreate the resource.
This means, your custom resource have to handle the update signal from CloudFormation.
Now, how to tell if the Update signal is actually to "revert" the update? You need to tell from the ResourceProperties
.
For example, if I am updating my custom resource from
new CustomResource(this, 'MyResource', {
serviceToken: provider.serviceToken,
properties: {},
});
to
new CustomResource(this, 'MyResource', {
serviceToken: provider.serviceToken,
properties: {
foo: 'bar',
},
});
by adding { foo: ' bar' } in the properties, I will see this event log
Received event: {
RequestType: 'Update',
ServiceToken: ...
ResponseURL: ...
LogicalResourceId: 'MyResource',
PhysicalResourceId: 'my-custom-resource-id',
ResourceType: 'AWS::CloudFormation::CustomResource',
ResourceProperties: {
ServiceToken: ...
foo: 'bar'
},
OldResourceProperties: {
ServiceToken: ...
}
}
Now if you receive the Update signal from CloudFormation from the cancelling behavior, you will receive this
ResourceProperties: {
ServiceToken: ...
},
OldResourceProperties: {
ServiceToken: ...
foo: 'bar'
}
Your isComplete handler needs to figure out if it's a cancelling Update from there and yes it could be challenging.
One possible trick is to define a custom revision
prop, when you update the prop, you always increment it by 1
.
For example:
Old properties:
new CustomResource(this, 'MyResource', {
serviceToken: provider.serviceToken,
properties: {
foo: 'bar',
revision: 1
},
});
New properties
new CustomResource(this, 'MyResource', {
serviceToken: provider.serviceToken,
properties: {
foo: 'what ever new value',
revision: 2
},
});
Now if your isComplete handler receives a update event that revision
value of OldResourceProperties
is actually greater than the one from ResourceProperties
then you know it's a cancelling Update and you should simply return { 'IsComplete':true }
check out my PoC:
export class MyStack extends cdk.Stack {
constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
super(scope, id, props);
// Define the custom resource provider
const provider = new cr.Provider(this, 'MyProvider', {
onEventHandler: new lambda.Function(this, 'OnEventHandler', {
runtime: lambda.Runtime.NODEJS_LATEST,
handler: 'index.handler',
code: lambda.Code.fromInline(`
exports.handler = async (event) => {
console.log('Received event:', event);
// Generate a physical resource ID
const physical_id = 'my-custom-resource-id';
return { 'PhysicalResourceId': physical_id };
};
`),
}),
isCompleteHandler: new lambda.Function(this, 'IsCompleteHandler', {
runtime: lambda.Runtime.NODEJS_LATEST,
handler: 'index.handler',
code: lambda.Code.fromInline(`
exports.handler = async (event) => {
console.log('Received event:', event);
// always return true on resource create and delete
let is_ready = true;
if (event.RequestType === 'Update') {
const oldRevision = event.OldResourceProperties.revision;
const newRevision = event.ResourceProperties.revision;
if (oldRevision !== undefined && newRevision !== undefined) {
// always true when oldRevision > newRevision
is_ready = parseFloat(oldRevision) > parseFloat(newRevision);
}
}
return { 'IsComplete': is_ready };
};
`),
}),
totalTimeout: cdk.Duration.hours(1),
});
new CustomResource(this, 'MyResource', {
serviceToken: provider.serviceToken,
properties: {
revision: 1,
foo: 'bar',
},
});
}
}
and update the property with
properties: {
revision: 2,
foo: 'whatever newer value',
},
On cancelling the update from console or CLI, your custom resource would revert as expected and your whole stack would enter the UPDATE_ROLLBACK_COMPLETE
state.
This is very tricky but as this is how CloudFormation is designed, we need to work it around with tips like that. Let me know if it works for you.
I'll need a few weeks to verify if the suggestion works, thanks for the help
Hello, I verified, and this doesn't address the problem.
The problem is that, even after the update 1->2 is cancelled, and the rollback 2->1 is executed, the "isCompleted" handler for the 1->2 resource keeps running, well after Cloudformation marks the deployment as cancelled and rollback complete.
I attach the logs that show the result of your toy example: You can see that the update never stops running, even after the rollback runs (you can identify the rollback because oldResource revision and resource revision are inverted, it only runs once).
CloudWatch Logs Insights
region: eu-west-1
log-group-names: /aws/lambda/MyStack2-IsCompleteHandler7073F4DA-POOWk0PpCxs3
start-time: -3600s
end-time: 0s
query-string:
filter @message like "Received event"
| fields @timestamp, RequestType,RequestId, OldResourceProperties.revision, ResourceProperties.revision
@timestamp | RequestType | RequestId | OldResourceProperties.revision | ResourceProperties.revision |
---|---|---|---|---|
2024-09-27 13:04:57.583 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:04:52.376 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:04:36.651 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:04:31.439 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:04:26.142 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:04:20.939 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:04:15.670 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:04:10.391 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:04:05.231 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:04:00.012 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:03:54.783 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:03:49.521 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:03:44.290 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:03:39.027 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:03:33.742 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:03:28.493 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:03:23.248 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:03:18.042 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:03:17.386 | Update | afb3071b-d366-4359-afe2-e5007954cbf8 | 1727442032391 | 1 |
2024-09-27 13:03:12.751 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:03:07.509 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:03:02.267 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:02:57.014 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:02:51.702 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:02:46.490 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:02:41.217 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:02:35.918 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:02:30.683 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:02:25.469 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:02:20.220 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:02:14.952 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:02:09.709 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:02:04.435 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:01:59.165 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:01:53.870 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:01:48.572 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:01:43.303 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:01:38.062 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:01:32.758 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:01:27.451 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:01:22.084 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:01:16.809 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:01:11.540 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:01:06.250 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:01:00.957 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:00:55.630 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
2024-09-27 13:00:49.883 | Update | e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 | 1 | 1727442032391 |
It is still running as I type this.
Since CFN actually marks the deployment as complete, I must assume this is a bug, because we shouldn't be running stuff after the deployment is complete.
OK I guess you are right.
When the initial UPDATE happens, there would be a state machine that invokes a lambda function every queryInverval time until totalTimeout. CFN is not aware of this and no one would stop it until the invoked lambda function return True
. We can consider this as an forked async process periodically checking an external state until success or timeout.
I have 3 solutions off the top of my head
Define smaller totalTimeout. The default is 30min which means you can’t stop it until it gives up after 30min. Make it smaller like 10min could be helpful though you still can’t stop it immediately.
Stop that by yourself when you detect it’s a canceling update. I think this should be handled by the framework not user code and could be a p2 feature request which the team might not be able to address immediately. You will need to find out which state machine is running and stop it yourself when you detect that.
Now that you've cancelled that cloudformation update as human intervention. I believe it's a development or testing operation which means you absolutely could stop the running state machine from state machine console as well if that really bothers you. Ideally I agree this would be nice if the provider framework could clean it up but I am not sure if this feature would be prioritize.
I am making it a p2 feat and please help us prioritize with 👍
According to this
The onEvent should have the statemachine ARN via WAITER_STATE_MACHINE_ARN_ENV
and technically you onEvent handler would be able to stop it?
I confirmed that onEvent would have the state machine ARN but it's actually required by isComplete handler.
I guess a small PR that addEnvironment() this to isComplete handler might be enough to pass the ARN to isComplete then user can stop the state machine from there.
In respect to the three options above:
@pahud what is the expected timelines for a P2 feature request?
In the meantime, If I understand you correctly, the State Machine ARN is already available for the user OnEvent function, correct? In this case, when the CFN stack rollback runs and calls the user OnEvent function we can find the running State Machine execution (from the roll forward CFN stack update) and cancel it. Does that make sense and can you confirm the State Machine is started after the OnEvent user function is run or else we need to know the current State Machine execution as well to avoid cancelling the wrong State Machine execution
what is the expected timelines for a P2 feature request?
p2 means the team can't get to it immediately but we welcome PRs.
At the same time, we welcome upvotes 👍 to help us prioritize. Any P2 issue with 20 or more +1s will be automatically upgraded from P2 to P1.
Check here for more details.
Describe the bug
My team developed a custom resource that runs in asynchronous mode. We have noticed that when we cancel a CloudFormation stack update while an on-going WaiterStateMachine is running, the on-going WaiterStateMachine is not cancelled/stopped and it continues to call our IsComplete handler.
In the meantime the Rollback deployment starts and a new instance of the custom resource is called, and this leads to conflicts because the IsComplete handler for the cancelled deployment is still being called.
Regression Issue
Last Known Working CDK Version
No response
Expected Behavior
What we ask is for a change such that all stack resources –including the WaiterStateMachine– stop on a stack cancel.
Current Behavior
When we cancel a CloudFormation stack update the on-going WaiterStateMachine is not cancelled/stopped and it continues to call our IsComplete handler.
Reproduction Steps
Create asynchronous custom resource in a stack, deploy the stack, cancel the deployment.
Possible Solution
What we ask is for a change such that all stack resources –including the WaiterStateMachine– stop on a stack cancel.
This behavior can be enabled by a flag for backwards compatibility.
Alternatively, you can pass a flag that says if the deployment is cancelled into the event for the IsComplete handler, so that the implementer can decide to check that flag and always return "true" if that flag is set (or keep running if that's their jam).
Additional Information/Context
No response
CDK CLI Version
2.154.1
Framework Version
No response
Node.js Version
18
OS
AL2
Language
TypeScript
Language Version
No response
Other information
No response