Open RemyDeWolf opened 2 years ago
It looks like there was some earlier work investigating using spot fleets in #269, but it was never finished.
Hi @RemyDeWolf, we've had a look at what we would need to make this change, and it looks like enabling the use of a spot fleet instead of an ASG is not as straight forward a change for us to take on at this time. Furthermore, we would also want to update https://github.com/buildkite/buildkite-agent-scaler/ to work with spot fleets at the same time as well.
However, this is a feature we would very much support the contribution of PRs for. So please don't hesitate to have a shot at it yourself.
Is your feature request related to a problem? Please describe. We would like to have CloudWatch metrics about our Spot instances such as
FulfilledCapacity
andTargetCapacity
we can automatically detect Spot outages. (AWS reference for these metrics)Currently, the limitation with this stack, is the Autoscaling Group is configured to launch "Spot Instance", and AWS doesn't provide any Spot metrics for these. If we want the metrics, the Autoscaling Group should be configured to leverages a "Spot Fleet" This is entirely different from creating an Autoscaling Group that leverages a Spot Fleet.
Describe the solution you'd like Expose a parameter
UseSpotFleet
, defaultfalse
, to configure the load balancer to use a Spot Fleet. Here is the documentation on how to configure Spot Fleet on an autoscaler. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet-automatic-scaling.htmlDescribe alternatives you've considered If we had a fallback to use OnDemand instances when Spot is not available, we would not need to implement this. https://github.com/buildkite/elastic-ci-stack-for-aws/issues/851
Additional context We set
OnDemandPercentage=0
to save on cost, and it works most of the time. But we experience some spot outage every few days and we would like to automatically fall back toOnDemandPercentage=
100` when we detect the spot outage. We would write a custom lambda to update the OnDemandPercentage when we detect an outage. Having no CloudWatch metrics make it hard for us to automate this process.