aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.61k stars 3.91k forks source link

CustomResource Provider: WaiterStateMachine doesn't stop when stack deployment is cancelled. #31541

Open vvigilante opened 4 weeks ago

vvigilante commented 4 weeks ago

Describe the bug

My team developed a custom resource that runs in asynchronous mode. We have noticed that when we cancel a CloudFormation stack update while an on-going WaiterStateMachine is running, the on-going WaiterStateMachine is not cancelled/stopped and it continues to call our IsComplete handler.

In the meantime the Rollback deployment starts and a new instance of the custom resource is called, and this leads to conflicts because the IsComplete handler for the cancelled deployment is still being called.

Regression Issue

Last Known Working CDK Version

No response

Expected Behavior

What we ask is for a change such that all stack resources –including the WaiterStateMachine– stop on a stack cancel.

Current Behavior

When we cancel a CloudFormation stack update the on-going WaiterStateMachine is not cancelled/stopped and it continues to call our IsComplete handler.

Reproduction Steps

Create asynchronous custom resource in a stack, deploy the stack, cancel the deployment.

Possible Solution

What we ask is for a change such that all stack resources –including the WaiterStateMachine– stop on a stack cancel.

This behavior can be enabled by a flag for backwards compatibility.

Alternatively, you can pass a flag that says if the deployment is cancelled into the event for the IsComplete handler, so that the implementer can decide to check that flag and always return "true" if that flag is set (or keep running if that's their jam).

Additional Information/Context

No response

CDK CLI Version

2.154.1

Framework Version

No response

Node.js Version

18

OS

AL2

Language

TypeScript

Language Version

No response

Other information

No response

khushail commented 3 weeks ago

Hi @vvigilante , thanks for reaching out. As explained in the How custom resources work, here are my 2cents -

  1. The custom resource provider processes the AWS CloudFormation request and returns a response of SUCCESS or FAILED to the pre-signed URL. AWS CloudFormation waits and listens for a response in the pre-signed URL location.
  2. After getting a SUCCESS response, AWS CloudFormation proceeds with the stack operation. If a FAILURE or no response is returned, the operation fails. So the isCompleteHandler event is called to check the status of async operation's completion

Since the update is happening in between when the other event is happening, there is no way to send the signal to CFN. So its waiting for response and then fails, hence the conflict.

CDK has CfnResponse module which is a library that simplifies sending responses to the custom resource that invoked your Lambda function. The module has a send method that sends a response object to a custom resource by way of an Amazon S3 presigned URL (the ResponseURL). But there is no such module for async operations as far as I know.

Please feel free to share your thoughts on this and correct me if anything is misunderstood.

cc: @pahud

pahud commented 3 weeks ago

We have noticed that when we cancel a CloudFormation stack update while an on-going WaiterStateMachine is running, the on-going WaiterStateMachine is not cancelled/stopped and it continues to call our IsComplete handler.

OK so sounds like:

  1. You have a stack that has a custom resource using provider framework with isComplete handler.
  2. You have created it with no error.
  3. When you update this stack and cancel the update, the WaiterStateMachine is not cancelled/stopped.

Is it correct?

How did you cancel that update? From the console?

image
pahud commented 3 weeks ago

I tried to simulate your use case using the PoC below which would immediate create/delete the CR but would not complete on resource update.

export class MyStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Define the custom resource provider
    const provider = new cr.Provider(this, 'MyProvider', {
      onEventHandler: new lambda.Function(this, 'OnEventHandler', {
        runtime: lambda.Runtime.NODEJS_LATEST,
        handler: 'index.handler',
        code: lambda.Code.fromInline(`
          exports.handler = async (event) => {
            console.log('Received event:', event);
            // Generate a physical resource ID
            const physical_id = 'my-custom-resource-id';
            return { 'PhysicalResourceId': physical_id };
          };
        `),
      }),
      isCompleteHandler: new lambda.Function(this, 'IsCompleteHandler', {
        runtime: lambda.Runtime.NODEJS_LATEST,
        handler: 'index.handler',
        code: lambda.Code.fromInline(`
          exports.handler = async (event) => {
            console.log('Received event:', event);
            // Always return false when update
            const is_ready = event.RequestType == 'Update' ? false : true;
            return { 'IsComplete': is_ready };
          };
        `),
      }),
      totalTimeout: cdk.Duration.hours(1),
    });

    new CustomResource(this, 'MyResource', {
      serviceToken: provider.serviceToken,
      properties: {
        foo: 'bar',
      },
    });
  }
}

I cancelled the update from the console and got

image

But the custom resource update operation is actually not cancelled

image

This is because of the CloudFormation Custom Resource behavior: When you cancel a stack update, CloudFormation will attempt to roll back to the previous stable state. For custom resources, this means:

If the custom resource was being created, CloudFormation will send a "Delete" request to your custom resource provider.

If the custom resource was being updated, which is your use case, CloudFormation will send another "Update" request to revert to the previous state.

If the custom resource was being deleted, CloudFormation will send a "Create" request to recreate the resource.

This means, your custom resource have to handle the update signal from CloudFormation.

Now, how to tell if the Update signal is actually to "revert" the update? You need to tell from the ResourceProperties.

For example, if I am updating my custom resource from

new CustomResource(this, 'MyResource', {
      serviceToken: provider.serviceToken,
      properties: {},
    });

to

new CustomResource(this, 'MyResource', {
      serviceToken: provider.serviceToken,
      properties: {
        foo: 'bar',
      },
    });

by adding { foo: ' bar' } in the properties, I will see this event log

Received event: {
  RequestType: 'Update',
  ServiceToken: ...
  ResponseURL: ...
  LogicalResourceId: 'MyResource',
  PhysicalResourceId: 'my-custom-resource-id',
  ResourceType: 'AWS::CloudFormation::CustomResource',
  ResourceProperties: {
    ServiceToken: ...
    foo: 'bar'
  },
  OldResourceProperties: {
    ServiceToken: ...
  }
}

Now if you receive the Update signal from CloudFormation from the cancelling behavior, you will receive this

  ResourceProperties: {
    ServiceToken: ...
  },
  OldResourceProperties: {
    ServiceToken: ...
    foo: 'bar'
  }

Your isComplete handler needs to figure out if it's a cancelling Update from there and yes it could be challenging.

One possible trick is to define a custom revision prop, when you update the prop, you always increment it by 1.

For example:

Old properties:

new CustomResource(this, 'MyResource', {
      serviceToken: provider.serviceToken,
      properties: {
        foo: 'bar',
        revision: 1
      },
    });

New properties

new CustomResource(this, 'MyResource', {
      serviceToken: provider.serviceToken,
      properties: {
        foo: 'what ever new value',
        revision: 2
      },
    });

Now if your isComplete handler receives a update event that revision value of OldResourceProperties is actually greater than the one from ResourceProperties then you know it's a cancelling Update and you should simply return { 'IsComplete':true }

check out my PoC:

export class MyStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Define the custom resource provider
    const provider = new cr.Provider(this, 'MyProvider', {
      onEventHandler: new lambda.Function(this, 'OnEventHandler', {
        runtime: lambda.Runtime.NODEJS_LATEST,
        handler: 'index.handler',
        code: lambda.Code.fromInline(`
          exports.handler = async (event) => {
            console.log('Received event:', event);
            // Generate a physical resource ID
            const physical_id = 'my-custom-resource-id';
            return { 'PhysicalResourceId': physical_id };
          };
        `),
      }),
      isCompleteHandler: new lambda.Function(this, 'IsCompleteHandler', {
        runtime: lambda.Runtime.NODEJS_LATEST,
        handler: 'index.handler',
        code: lambda.Code.fromInline(`
          exports.handler = async (event) => {
            console.log('Received event:', event);

            // always return true on resource create and delete
            let is_ready = true;

            if (event.RequestType === 'Update') {
              const oldRevision = event.OldResourceProperties.revision;
              const newRevision = event.ResourceProperties.revision;

              if (oldRevision !== undefined && newRevision !== undefined) {
                // always true when oldRevision > newRevision
                is_ready = parseFloat(oldRevision) > parseFloat(newRevision);
              }
            }

            return { 'IsComplete': is_ready };
          };
        `),
      }),
      totalTimeout: cdk.Duration.hours(1),
    });

    new CustomResource(this, 'MyResource', {
      serviceToken: provider.serviceToken,
      properties: {
        revision: 1,
        foo: 'bar',
      },
    });
  }
}

and update the property with

properties: {
        revision: 2,
        foo: 'whatever newer value',
      },

On cancelling the update from console or CLI, your custom resource would revert as expected and your whole stack would enter the UPDATE_ROLLBACK_COMPLETE state.

This is very tricky but as this is how CloudFormation is designed, we need to work it around with tips like that. Let me know if it works for you.

vvigilante commented 3 weeks ago

I'll need a few weeks to verify if the suggestion works, thanks for the help

vvigilante commented 3 weeks ago

Hello, I verified, and this doesn't address the problem.

The problem is that, even after the update 1->2 is cancelled, and the rollback 2->1 is executed, the "isCompleted" handler for the 1->2 resource keeps running, well after Cloudformation marks the deployment as cancelled and rollback complete.

I attach the logs that show the result of your toy example: You can see that the update never stops running, even after the rollback runs (you can identify the rollback because oldResource revision and resource revision are inverted, it only runs once).

CloudWatch Logs Insights
region: eu-west-1
log-group-names: /aws/lambda/MyStack2-IsCompleteHandler7073F4DA-POOWk0PpCxs3
start-time: -3600s
end-time: 0s
query-string:

  filter @message like "Received event"
| fields @timestamp, RequestType,RequestId, OldResourceProperties.revision, ResourceProperties.revision

@timestamp RequestType RequestId OldResourceProperties.revision ResourceProperties.revision
2024-09-27 13:04:57.583 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:04:52.376 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:04:36.651 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:04:31.439 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:04:26.142 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:04:20.939 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:04:15.670 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:04:10.391 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:04:05.231 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:04:00.012 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:03:54.783 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:03:49.521 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:03:44.290 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:03:39.027 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:03:33.742 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:03:28.493 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:03:23.248 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:03:18.042 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:03:17.386 Update afb3071b-d366-4359-afe2-e5007954cbf8 1727442032391 1
2024-09-27 13:03:12.751 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:03:07.509 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:03:02.267 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:02:57.014 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:02:51.702 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:02:46.490 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:02:41.217 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:02:35.918 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:02:30.683 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:02:25.469 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:02:20.220 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:02:14.952 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:02:09.709 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:02:04.435 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:01:59.165 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:01:53.870 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:01:48.572 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:01:43.303 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:01:38.062 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:01:32.758 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:01:27.451 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:01:22.084 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:01:16.809 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:01:11.540 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:01:06.250 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:01:00.957 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:00:55.630 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391
2024-09-27 13:00:49.883 Update e4662b2e-d0f4-4cf6-8a94-4c9e60f52718 1 1727442032391

It is still running as I type this.

Since CFN actually marks the deployment as complete, I must assume this is a bug, because we shouldn't be running stuff after the deployment is complete.

pahud commented 3 weeks ago

OK I guess you are right.

When the initial UPDATE happens, there would be a state machine that invokes a lambda function every queryInverval time until totalTimeout. CFN is not aware of this and no one would stop it until the invoked lambda function return True. We can consider this as an forked async process periodically checking an external state until success or timeout.

I have 3 solutions off the top of my head

  1. Define smaller totalTimeout. The default is 30min which means you can’t stop it until it gives up after 30min. Make it smaller like 10min could be helpful though you still can’t stop it immediately.

  2. Stop that by yourself when you detect it’s a canceling update. I think this should be handled by the framework not user code and could be a p2 feature request which the team might not be able to address immediately. You will need to find out which state machine is running and stop it yourself when you detect that.

  3. Now that you've cancelled that cloudformation update as human intervention. I believe it's a development or testing operation which means you absolutely could stop the running state machine from state machine console as well if that really bothers you. Ideally I agree this would be nice if the provider framework could clean it up but I am not sure if this feature would be prioritize.

I am making it a p2 feat and please help us prioritize with 👍

pahud commented 3 weeks ago

According to this

https://github.com/aws/aws-cdk/blob/8318e7968c441ad565139e5faa72977a95099cd2/packages/aws-cdk-lib/custom-resources/lib/provider-framework/provider.ts#L235C1-L236C1

The onEvent should have the statemachine ARN via WAITER_STATE_MACHINE_ARN_ENV and technically you onEvent handler would be able to stop it?

pahud commented 3 weeks ago

I confirmed that onEvent would have the state machine ARN but it's actually required by isComplete handler.

image

I guess a small PR that addEnvironment() this to isComplete handler might be enough to pass the ARN to isComplete then user can stop the state machine from there.

mmoanis commented 1 week ago

In respect to the three options above:

  1. We can not define smaller timeout as we expect the resource to take longer time
  2. The case is for an emergency abort deployment. In this case we want to stop the on-going stack update, have the stack rollback reverting the resource to the original state. I agree this can be done manually on emergencies but I would prefer we have emergency actions automated in safe programmatic ways than rely on manual actions and knowledge of the underlying implementation details

@pahud what is the expected timelines for a P2 feature request?

In the meantime, If I understand you correctly, the State Machine ARN is already available for the user OnEvent function, correct? In this case, when the CFN stack rollback runs and calls the user OnEvent function we can find the running State Machine execution (from the roll forward CFN stack update) and cancel it. Does that make sense and can you confirm the State Machine is started after the OnEvent user function is run or else we need to know the current State Machine execution as well to avoid cancelling the wrong State Machine execution

pahud commented 1 week ago

what is the expected timelines for a P2 feature request?

p2 means the team can't get to it immediately but we welcome PRs.

At the same time, we welcome upvotes 👍 to help us prioritize. Any P2 issue with 20 or more +1s will be automatically upgraded from P2 to P1.

Check here for more details.