aws-cloudformation / cloudformation-resource-schema

The CloudFormation Resource Schema defines the shape and semantic for resources provisioned by CloudFormation. It is used by provider developers using the CloudFormation RPDK.
Apache License 2.0
95 stars 38 forks source link

add custom stabilization timeout in handlers schema #61

Closed aygold92 closed 4 years ago

aygold92 commented 4 years ago

Issue #, if available:

Description of changes: This adds a new parameter timeoutInMinutes to the handlerDefinition, in order for providers to specify a custom stabilization timeout, to be used by CloudFormation. This could be used by the handler itself at some point to stop doing work.

I used timeoutInMinutes since it is much simpler for handlers rather than specifying some ISO format, and I don't see any need to have any more granularity than that

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

rjlohan commented 4 years ago

Until the platform is capable of actually respecting this information I don't think it should be part of the schema.

aygold92 commented 4 years ago

Until the platform is capable of actually respecting this information I don't think it should be part of the schema.

wanted to get the idea out there, but I will wait to merge this until its actually supported in the CFN platform

benkehoe commented 4 years ago

This shouldn't necessarily be static, right? With the DynamoDB GSI feature request, the length of time is highly dependent on the data and usage. So instead you might want the resource to be able to report an expected completion time dynamically, which could then also be reported to the user.

benbridts commented 4 years ago

I think a static (maybe not required?) "maximum" could be useful, as a dynamic value would need extra data to be returned by the resource (increasing complexity for handler that are fine with a static value, eg. code that never returns InProgress).

For dynamic values it might be interesting to have both a "expected completion" and a "pause time" value that can be returned.

The expected completion could be exposed to the end user (a "completion percentage" might be another option for this). The pause time could indicate that CloudFormation should not invoke the provider for a certain amount of time. While CloudFormation waits the cost for going over the 30s threshold could be lower (ideally 0)

aygold92 commented 4 years ago

right, this is intended to signify the maximum timeout for a resource. A separate protocol could be developed using progress events to improve stabilization and possibly add dynamic timeouts (though dynamic timeouts could only shorten the timeout, and would still be subject to this maximum timeout)

tobywf commented 4 years ago

the maximum is especially useful for the failure case. e.g. if you're developing a resource, right now if you get delete wrong you have to wait 2h until it times out. for normal operation, it doesn't affect the provisioning duration.

you can already "sleep" the handler via the callback (edit: although this doesn't change or pause the timeout). true, it would be nice to have a dynamic system where the first invocation has a timeout of a few minutes, and then can request/guesstimate further timeouts as required via the callback. but this is tricky w.r.t. credentials. i think the approach Andrew's proposing is a decent compromise for now?

PatMyron commented 4 years ago

This shouldn't necessarily be static, right? With the DynamoDB GSI feature request, the length of time is highly dependent on the data and usage. So instead you might want the resource to be able to report an expected completion time dynamically, which could then also be reported to the user.

The completion time also depends on some factors that cannot be known ahead of time. Both of these factors can change drastically at any point during DynamoDB GSI creation:

  • The provisioned write capacity of the index
  • Write activity on the main table during index builds
benkehoe commented 4 years ago

Both of these factors can change drastically at any point during DynamoDB GSI creation

This shows that not only must the timeout be dynamic, but that it also must be modifiable during the operation, not just dynamically reported at the beginning.

rjlohan commented 4 years ago

This PR is only implementing a default maximum time which CloudFormation should respect when terminating non-responsive, or incomplete handlers. This timeout is treated as a FAILED operation. Hence, there is already dynamism built into the platform for any handler - if a handler decides that it has run longer than "it should" (e.g; "well this GSI creation is taking longer than I expected...") it can return a FAILED response at any time during operation with some sort of TimeoutExceeded exception or message.

I don't see any value in trying to introduce dynamic timeouts to either the schema or the interface, over the capabilities we already have.

benkehoe commented 4 years ago

So this timeout isn't a timeout for the operation, but for a specific invocation of the handler? That is, I set the timeout to 30 seconds, it's fine for the operation to take five minutes as long as my handler keeps returning a progress report in less than 30 seconds when it gets invoked (until the operation is complete)?

aygold92 commented 4 years ago

So this timeout isn't a timeout for the operation, but for a specific invocation of the handler? That is, I set the timeout to 30 seconds, it's fine for the operation to take five minutes as long as my handler keeps returning a progress report in less than 30 seconds when it gets invoked (until the operation is complete)?

No, this timeout is the max possible time the entire operation would take. So let's say you set the timeout to 5 minutes: CFN will invoke the handler and then listen to progress events for 5 minutes. If the operation is still IN_PROGRESS after 5 mnutes, CFN will stop listening and will terminate the operation.

benkehoe commented 4 years ago

this timeout is the max possible time the entire operation would take.

So then I don't understand @rjlohan's comment that the purpose is "terminating non-responsive, or incomplete handlers"

Does a successfully-returning handler reporting IN_PROGRESS count as non-responsive? If I don't explicitly set the timeout value in the schema, presumably CloudFormation has some default setting; is there a way to know programmatically that the timeout for an operation is the default and what the default is?

rjlohan commented 4 years ago

Does a successfully-returning handler reporting IN_PROGRESS count as non-responsive? If I don't explicitly set the timeout value in the schema, presumably CloudFormation has some default setting; is there a way to know programmatically that the timeout for an operation is the default and what the default is?

No, that'd fall into the responsive, but incomplete handler bucket. e.g; a type may periodically check that it has stabilized, but perpetually fail to do so. Then the timeout gives CloudFormation a signal that it should stop waiting.

benkehoe commented 4 years ago

responsive, but incomplete handler

I get it now. Thanks! I still am wondering how I can know programmatically what the default timeout is.