Add support for instance cordoning

keithduncan commented 2 years ago

Is your feature request related to a problem? Please describe.

Presently, when an agent is failing builds, the only way to fix it is to stop the agent (which terminates the instance) or terminate the instance directly.

In order to perform diagnosis on instances, it would be useful to be able to "cordon" an instance while stopping the agent from accepting any more jobs.

Describe the solution you'd like

Simply not dispatching to a given agent from buildkite.com would not be sufficient. Cordoning at the agent level would prevent a replacement instance from being booted in order to maintain pool capacity.

Instead, infrastructure level cordoning would remove the instance from the Auto Scaling group. Using autoscaling:EnterStandby would keep an ASG reference to the instance vs instance detach from the ASG, and the desired count would be maintained such that a replacement instance is booted.

The way I would expose this infrastructure level functionality up to the buildkite.com API and UI would be to include an agent lifecycle hook called cordon. If present when registering the agent with the API, set a flag that indicates the agent has a cordon hook that can be invoked.

In the Elastic CI Stack’s cordon hook I would either invoke the AWS CLI directly, or use an AWS SSM Automation to stop the agent systemd job and set the instance to standby.

Decoupling the agent and instance lifetimes may depend on the work started in #964 the solution may also need to take instances that set disconnect-after-job into consideration.

Describe alternatives you've considered

As above, keeping the agent alive but not dispatching to it is an inferior solution.

keithduncan commented 2 years ago

Simply not dispatching to a given agent from buildkite.com would not be sufficient.

Some more thoughts on this. I think we could do both agent and instance cordoning, keep the agent around so it shows in the UI, but in a non-dispatchable state. The key part will be to ensure the instance and agent aren’t considered "available" by the buildkite-agent-scaler so that the pool is appropriately sized without assuming that the instance / agent is available for work.

Another factor to consider when cordoning an "agent" is multiple AGENTS_PER_INSTANCE. We wouldn’t want to pull an instance with multiple agents out of service and keep dispatching to some of the agents on it. Stopping the agent completely does seem like a more reliable way to guarantee that the instance doesn’t do any work.

ptarjan commented 2 years ago

+1 to this feature request for Robinhood. We only use one agent per instance so don't have that edge case. Ideally the feature would be part of the UI next to the "Stop Agent" button.

buildkite / elastic-ci-stack-for-aws

Add support for instance cordoning #972