[Feature Request] Force Removal Taint

jackcasey-visier commented 2 months ago

Is your feature request related to a problem? Please describe.

Hello! We're an organization with fairly heavy duty usage on Escalator, and we love it so far!

Something we've recently been wrestling with, is an issue with our storage provider (totally unrelated to Escalator) that impacts nodes running new pods, live pods are fine. Unfortunately when nodes are in this state, it's not automatically recoverable and the node needs to be removed. We can add a taint, but Escalator won't scale the node down until the job queue is empty.

Describe the solution you'd like

My proposed solution, is to implement a new taint. Something like:

atlassian.com/escalator=force:NoSchedule

This taint would be added by a custom health checking process, and not the responsibility of Escalator.

When this is encountered during the Escalator loop, a check would be made for running pods, and if none are found the node is scaled out, regardless of if there are still pending jobs. This would not have any impact on the existing usage of taints.

This way, the entire system is able to purge bad nodes for whatever reason and scale up as normal.

Very curious if there is any interest in this functionality! I am happy to implement it if so!

Thank you

awprice commented 2 months ago

Hello! We're an organization with fairly heavy duty usage on Escalator, and we love it so far!

Awesome to hear!

When this is encountered during the Escalator loop, a check would be made for running pods, and if none are found the node is scaled out, regardless of if there are still pending jobs. This would not have any impact on the existing usage of taints.

This way, the entire system is able to purge bad nodes for whatever reason and scale up as normal.

It's funny - we're actually interested in this feature ourselves too. We have cases were bad configuration has been rolled out to some or all nodes, and we'd like a method to quickly remove these bad nodes, regardless of how old they are.

Our thinking has also been some sort of taint that users can apply to the node, in which if Escalator sees it - it will prioritise replacing that node ASAP.

Currently we rely on doubling the desired amount of instances in the autoscaling group to force Escalator to replace old, broken nodes, however this is time consuming, requires access to modify the autoscaling group and is sometimes costly.

So having a taint to selectively do this would be great - it allows us to only target the nodes with the bad configuration as well as let our internal cluster users apply the taint themselves without need for AWS access.

Very curious if there is any interest in this functionality! I am happy to implement it if so!

We'd love it if you're able to implement this. Also happy to spend the time reviewing and testing it once implemented.

jackcasey-visier commented 2 months ago

@awprice This is a match made in heaven then eh!

I'll put something together, ideally within the next week :) To be totally transparent I've never touched Go before, and am making assumptions about how quick it'll be to implement (we're a Scala/Java shop).

Also happy to spend the time reviewing and testing it once implemented.

Really appreciate this! I'll follow up in this thread once things are moving forward :)

At a high level, how do you feel about the naming: atlassian.com/escalator=force?

jackcasey-visier commented 1 month ago

@awprice I've opened up a WIP PR with barebones logic. Are you able to take a peek and let me know your thoughts on overall pattern/direction? Thank you!

awprice commented 1 month ago

@jackcasey-visier 👍 will take a look

atlassian / escalator

[Feature Request] Force Removal Taint #245