aws-cloudformation / cloudformation-resource-schema

The CloudFormation Resource Schema defines the shape and semantic for resources provisioned by CloudFormation. It is used by provider developers using the CloudFormation RPDK.
Apache License 2.0
90 stars 38 forks source link

Allow replacementStrategy in the schema #86

Closed wenyhu closed 3 years ago

wenyhu commented 4 years ago

Description of changes: Allow replacement Strategy in the schema. This allows immutable singleton resources to update in the order of [delete, create].

Valid strategies are - [create, delete] - [delete, create]

Right now we always try to create then delete for immutable updates, allowing this replacementStrategy in the schema would allow update behavior be decided by the customer. Most useful singleton update that can have only one instance.

There should be be NO change in behavior for current customers as not specifying a strategy would default to [create, delete], which is how it behaves now.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

benbridts commented 4 years ago

Out of curiosity.

Since this is a setting that's supplied by the provider, are there any plans to allow a consumer to restrict the usage of this strategy? Currently CloudFormation consumers assume that every resource will only be deleted if all creates succeed.

Eg. by having a capability CAPABILITY_DELETE_BEFORE_CREATE.

On the CloudFormation Engine side (independent of the schema), it might make sense to do these resources as late as possible so that if another resource fails the delete hasn't happened yet

wenyhu commented 4 years ago

Out of curiosity.

Since this is a setting that's supplied by the provider, are there any plans to allow a consumer to restrict the usage of this strategy? Currently CloudFormation consumers assume that every resource will only be deleted if all creates succeed.

Eg. by having a capability CAPABILITY_DELETE_BEFORE_CREATE.

On the CloudFormation Engine side (independent of the schema), it might make sense to do these resources as late as possible so that if another resource fails the delete hasn't happened yet

Makes sense. So if a customer do not specifically opt in for delete first by adding a replacementStrategy, we will make the [create,delete] default and not change their current behavior (create then delete). Let me discuss further with the team on restrictions on this and loop back. Thanks for the helpful suggestion!

benbridts commented 4 years ago

So if a customer do not specifically opt in for delete first by adding a replacementStrategy, we will make the [create,delete] default and not change their current behavior (create then delete)

In this case you mean the customer when he's writing the resource, right?

It might makes sense to differentiate between the customer writing the resource (who explicitly changes the spec to override the default) and the customer using the resource (who can add a capability to a action). In a lot of cases this will be different people.

aygold92 commented 4 years ago

here we are talking about the behavior of the resource itself. For singleton resources (resources where only 1 of them can exist), we obviously cannot take the normal strategy of create first, then delete, since we cannot create a second resource.

There are some resources that CFN currently supports that behave this way, such as AWS::OpsWorks::ElasticLoadBalancerAttachment. During an update, CFN will delete and then create.

In order to have consistent behavior, we want to make the contract more explicit for Uluru, so we do not allow Update handlers to create a new physical resource. Introducing a value into the schema would allow us to more predictably and consistently support this use case.

Since this is a setting that's supplied by the provider, are there any plans to allow a consumer to restrict the usage of this strategy? Currently CloudFormation consumers assume that every resource will only be deleted if all creates succeed.

I think this is a valid concern. It is possible that the CFN could start the update -- it first deletes the old resource but then fails to create the new resource (or another resource fails and cancels the update before the creation can happen). The stack would then start rolling back and would eventually create the old resource again, but it might then be while before it gets to that point, depending on the chain of dependent resources. And this could potentially cause an unwanted outage. That being said, I don't see any way around this risk. How else would they do it? What would be the behavior of the resource if they didn't opt in?

benbridts commented 4 years ago

There are some resources that CFN currently supports that behave this way, such as AWS::OpsWorks::ElasticLoadBalancerAttachment. During an update, CFN will delete and then create Interesting! I will need to start adding more footnotes to my "this is how CFN works" explanations.

In order to have consistent behavior, we want to make the contract more explicit for Uluru, so we do not allow Update handlers to create a new physical resource. Introducing a value into the schema would allow us to more predictably and consistently support this use case

To be clear, I think that's a great idea. I asked mostly out of interests of how this would manifest to the end (template) user, who might - like me, wrongly - assume that a delete never happens in the update in progress phase.

The stack would then start rolling back and would eventually create the old resource again, but it might then be while before it gets to that point

Yes, and since third party providers use the same schema you have no guarantee that the old resource is recreatable (it should be, but people have a tendency to aim at their own feet sometimes).

That being said, I don't see any way around this risk. How else would they do it? What would be the behavior of the resource if they didn't opt in? If it's a real singleton, then there is no way around this. Besides not updating / replacing it. If it's not a real singleton they can open a bug with the resource provider (or do the two steps themselves in two updates).

However you can still reduce the risk, eg by doing the update of the singleton (if possible) in its own operation, and everything else in another.T

his is similar to how other capabilities work. I either accept the risk that the template developer can create IAM resources or I don't deploy the resources. And I can similarly read the policies in the template to reduce the risk of there being something dangerous in there.

That being said, I'm not certain that an extra capability is the best solution here (especially if there are already resources with this behaviour), but i do think consumer might want to know when a provider behaves unintuitive.

I was also thinking about what happens if the provider throws an NotUpdatable exception. But since I haven't tested the current behaviour of that, I'm assuming that it will do the right thing in both cases.

Thanks for the thoughtful responses!

rjlohan commented 4 years ago

I like the idea that we add a way for customers to block stack changes that do these non-standard replacement changes. That wasn't originally in scope, but it's a good idea. I think we can move ahead with the declarative changes as a start and then look at how to enable the protection idea.

wenyhu commented 4 years ago

can you add a section to the README that explains this property?

Will add.

rjlohan commented 4 years ago

@vladtsir has a good suggestion. As I commented earlier, I like the naming of this property as it expresses the impact of the setting. However it doesn't need to be a list. A string (scalar) with enum does the same job with less hassle.

"replacementStrategy": {
    "type": "string",
    "description": "The valid replacement strategies are [create_then_delete] and [delete_then_create]. All other inputs are invalid.",
    "default": "create_then_delete",
    "enum": [
        "create_then_delete",
        "delete_then_create"
    ]
},

There's only one other value I think might come up eventually which would be something like 'in place' - where an Update action triggers the replacement. An API that offers only a HTTP PUT on a List might be such a thing.