atlassian / escalator

Escalator is a batch or job optimized horizontal autoscaler for Kubernetes
Apache License 2.0
662 stars 59 forks source link

Disable HonorCooldown for AWS desired capacity changes #151

Closed awprice closed 5 years ago

awprice commented 5 years ago

We should set the HonorCooldown option to false when setting the desired capacity for the ASG.

We've seen cases where an instance is stuck in a Pending state and will block the ASG from being updated. This can occur for 30-40 minutes until the instance is terminated by AWS because it is failing health checks. The instance is usually in this state due to an underlying hardware issue.

Changing this value to false will allow Escalator to continue operating even when there are nodes with issues in the ASG.

Jacobious52 commented 5 years ago

Are there any downsides or side effects to settings this to false? If so, would having it as a defaulted option in the node group config make sense?

awprice commented 5 years ago

I don't think there are any downsides to setting this to false, as we already provide a safeguard in Escalator with the scale lock. The scale lock works the same way as the cooldown in Autoscaling groups in that it prevents runaway scaling.

Cluster-autoscaler also sets this to false and doesn't have an option to change it - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go#L201

To include it in the node group config would require some extra thought - this is an AWS specific setting, we will need a way to store per-cloudprovider settings in the node group config.