[batch] Jobs start super slow if scheduled immediately after EC2 instance termination

Dzhuneyt commented 4 years ago

Use case:

Batch Compute Env with minvCpus=0, maxvCpus=2
Submit a new job to a queue. It takes a relatively short time to go to RUNNING state the first time, because Batch is smart enough to provision a spot EC2 instance immediately and the only delay is from the instance provisioning, which is unavoidable.
Submit a new job to the queue immediately after the first job has completed and Batch destroyed the first EC2 instance.

Expected behavior: A new EC2 instance is provisioned immediately and the delay to start the second job is comparable to the delay for the start of the first one.

Current behavior: There is a "cooldown" period of about 5-10 minutes before another instance can be provisoned after another one was destroyed. I think this needs to be "adjustable" or at least -set to a better meaningful default.

Other

On a quick analysis I see that the EC2 spot instances are launched by an AutoScaling Group which has a cooldown setting of 300 seconds, which might or might not be affecting this thing.

[X] :wave: I may be able to implement this feature request
[ ] :warning: This feature might incur a breaking change

This is a :rocket: Feature Request

iliapolo commented 4 years ago

@Dzhuneyt Are you aware of a way to configure this via CloudFormation? Im not sure this is something the CDK can control. Feels more like an issue with the AWS Batch service itself.

github-actions[bot] commented 4 years ago

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

Dzhuneyt commented 4 years ago

I'm not sure as well. I believe it's the "cooldown" configuration of the AutoScalingGroup that CDK creates, which CDK CAN control. However, the L2 construct for ComputeEnvironment does not expose such possibilities.

My idea was to either: 1) expose such input props, 2) predefined better meaningful defaults internally, e.g. less cooldown value than 300 seconds

iliapolo commented 3 years ago

@Dzhuneyt you mention:

AutoScalingGroup that CDK creates, which CDK CAN control. However, the L2 construct for ComputeEnvironment does not expose such possibilities.

However, the CDK does not actually explicitly create an AutoScalingGroup, I imagine this is somewhat of an implementation detail of the batch service. The L2 construct for ComputeEnvironment doesn't expose this because its also not available on the L1:

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-batch-computeenvironment-computeresources.html

It seems there is currently no option to configure this because its not exposed in the batch service API.

My suggestion is to create a feature request for AWS Batch on the aws forums

Or did you have something else in mind?

As a workaround, you could implement a Custom Resource that reconfigures the AutoScalingGroup created by batch with the aws-sdk. But bear in mind its usually not advisable to alter the state of resources that are managed by aws services.

Dzhuneyt commented 3 years ago

Thanks for the detailed clarification. That's unfortunate, but I guess we can live with it.

github-actions[bot] commented 3 years ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

aws / aws-cdk

[batch] Jobs start super slow if scheduled immediately after EC2 instance termination #10943

Other

⚠️COMMENT VISIBILITY WARNING⚠️