buildkite / elastic-ci-stack-for-aws

An auto-scaling cluster of build agents running in your own AWS VPC
https://buildkite.com/docs/quickstart/elastic-ci-stack-aws
MIT License
417 stars 267 forks source link

Multiple stacks with same queue name #697

Closed xiaket closed 3 years ago

xiaket commented 5 years ago

Hi there,

I think we are not properly handling the case where there are multiple CI stacks configured to use the same queue name. I have two stacks running with a same queue name, one ASG with 2 instances running idle, the other ASG with 0 instances. Upon a new build, the expected behaviour is the first ASG will pick up the task while the other will take no action. The observed behaviour is the instances in the first ASG correctly picked up the task, but the other ASG scaled up to provide the capacity which is not needed.

A real world use case for this is an upgrade path for the CI stack. We could setup another stack using the new template while keeping the existing stack running. These two stacks would share the same queue name, and we could manually change the size of the ASG to gradually migrate all the workload from one ASG to the other.

I'm happy to dig into the code and possibly provide a PR for this, but want to hear your opinions first.

lox commented 5 years ago

This one is a known issue with Elastic CI stack, I'm not sure I have a good answer for how it could be fixed. Would welcome suggestions though!

xiaket commented 5 years ago

Cheers @lox, I'll take a look around and see what I can do. Will update here. :)

esalter commented 4 years ago

This is a big pain point for us as well.

yob commented 4 years ago

We're big fans of immutable infrastructure, so the upgrade-by-replacing use case is a good one. Is that where the pain is for you as well @esalter?

The complication is that each elastic stack is unaware of what else is monitoring the same queue, so they're unable to automatically change their scaling behaviour to work in tandem with eachother (like, only scaling up to 50% of the queue size).

The specific case of draining a stack and allowing the work to be picked up by a replacement stack monitoring the same queue is interesting. That wouldn't necessarily require co-ordination between the stacks, I guess a human could manually flip the old stack into "never scale up, only scale down" mode. I think the newer stack would then automatically start scaling up as required.

Such a switch would effectively involve turning off buildkite-agent-scaler in the old stack, or maybe ratcheting down the max size of the ASG in the old stack. Is that the kind of feature you're after?

esalter commented 4 years ago

Yes - I basically had to manually scale down the old stack while allowing the new one to scale up. It was actually easiest for us to do it off-hours so as to minimize developer disruption.

Another, similar-ish but different use case is we actually want to have different instance sizes/types watching the same queue. The main reason here is because we want to guard against spot terminations - if some instances are terminated, we'd like to try using a different instance type. We'd also like to fall back to on-demand instances if nothing else is available. This could be handled by modifying the stack to allow multiple instance types in the ASG (https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-purchase-options.html).

yob commented 4 years ago

I think I might transfer this issue to the elastic stack repo. I think it's unlikely we'll expand buildkite-agent-scaler itself to support draining. If we can do anything it's likely to be a layer up in the stack.

For mixed instances there's a proposal at https://github.com/buildkite/elastic-ci-stack-for-aws/pull/651, although I'll admit it's stalled a bit. There's definitely some interest in the model of multiple instance types within a single stack. Would that meet your needs, or would you prefer multiple single-instance-type stacks?

esalter commented 4 years ago

I think that makes sense. Thanks!

On first glance the linked issue looks like it would work. I'll take my comments there. Thanks for the tip!

keithduncan commented 3 years ago

Given that we have merged multiple instance type support, and support draining a stack’s instances by disabling that stack’s scaling lambda, I think all the work needed for this is done 🎉

I’m going to close this issue but if I’ve missed something in my reading of the comments please let me know and I’ll see what we can do.