buildkite / elastic-ci-stack-for-aws

An auto-scaling cluster of build agents running in your own AWS VPC
https://buildkite.com/docs/quickstart/elastic-ci-stack-aws
MIT License
417 stars 275 forks source link

Configure the Autoscaling Group to use Spot Fleet to enable CloudWatch Spot metrics #1053

Open RemyDeWolf opened 2 years ago

RemyDeWolf commented 2 years ago

Is your feature request related to a problem? Please describe. We would like to have CloudWatch metrics about our Spot instances such as FulfilledCapacity and TargetCapacity we can automatically detect Spot outages. (AWS reference for these metrics)

Currently, the limitation with this stack, is the Autoscaling Group is configured to launch "Spot Instance", and AWS doesn't provide any Spot metrics for these. If we want the metrics, the Autoscaling Group should be configured to leverages a "Spot Fleet" This is entirely different from creating an Autoscaling Group that leverages a Spot Fleet.

Describe the solution you'd like Expose a parameter UseSpotFleet, default false, to configure the load balancer to use a Spot Fleet. Here is the documentation on how to configure Spot Fleet on an autoscaler. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet-automatic-scaling.html

Describe alternatives you've considered If we had a fallback to use OnDemand instances when Spot is not available, we would not need to implement this. https://github.com/buildkite/elastic-ci-stack-for-aws/issues/851

Additional context We set OnDemandPercentage=0 to save on cost, and it works most of the time. But we experience some spot outage every few days and we would like to automatically fall back to OnDemandPercentage=100` when we detect the spot outage. We would write a custom lambda to update the OnDemandPercentage when we detect an outage. Having no CloudWatch metrics make it hard for us to automate this process.

sj26 commented 2 years ago

It looks like there was some earlier work investigating using spot fleets in #269, but it was never finished.

triarius commented 2 years ago

Hi @RemyDeWolf, we've had a look at what we would need to make this change, and it looks like enabling the use of a spot fleet instead of an ASG is not as straight forward a change for us to take on at this time. Furthermore, we would also want to update https://github.com/buildkite/buildkite-agent-scaler/ to work with spot fleets at the same time as well.

However, this is a feature we would very much support the contribution of PRs for. So please don't hesitate to have a shot at it yourself.