buildkite / elastic-ci-stack-for-aws

An auto-scaling cluster of build agents running in your own AWS VPC
https://buildkite.com/docs/quickstart/elastic-ci-stack-aws
MIT License
417 stars 267 forks source link

Add a lifecycle hook for instance launch #1015

Open Niksko opened 2 years ago

Niksko commented 2 years ago

Is your feature request related to a problem? Please describe. At the moment, the agent startup process uses cfn-signal to signal back to the Cloudformation stack that the instance has been launched successfully. However cfn-signal is only useful when the stack is being updated. If an instance is created (e.g. due to a difference between desired and actual capacity in the ASG), then the cfn-signal is not useful.

We observed a case where the instance failed to retrieve some files from S3 during bootstrap, because the instance metadata service failed to respond. The bootstrap then ends up in this on error handler, where it again failed to contact the instance metadata service.

The result is an instance in the ASG that can't accept any jobs, and the result is a queue that doesn't autoscale (because the right number of instances is present), but also doesn't ever accept any jobs.

Describe the solution you'd like This seems like it could be fixed by adding an Autoscaling Lifecyle Hook on instance launch. The hook would default to terminating the instance if not completed after some amount of time, and the bootstrap script would complete the lifecycle action.

Describe alternatives you've considered We're not sure why the instance metadata endpoint failed in our case, but we'll look in to it. Regardless though, any failure during startup may have the same effect, as long as the instance creation is a result of scale up due to capacity mismatches and not during an update of the Cloudformation stack.

lantrix commented 2 years ago

We're not sure why the instance metadata endpoint failed in our case, but we'll look in to it.

Amazon does note that they recommended consumption of the instance meta data to have an exponential backoff strategy.

We throttle queries to the instance metadata service on a per-instance basis, and we place limits on the number of simultaneous connections from an instance to the instance metadata service ... If you are throttled while accessing the instance metadata service, retry your query with an exponential backoff strategy.

The curl call at this line (and other lines) has a --fail setting which will fail fast with no output at all on server errors. Given AWS can throttle queries to the instance metadata service on a per-instance basis, maybe curl can be set on calling the instance metadata service http://169.254.169.254 to have an exponential backoff strategy.

For example curls --retry-delay <seconds> makes curl sleep this amount of time before each retry when a transfer has failed with a transient error (it changes the default backoff time algorithm between retries).

moskyb commented 2 years ago

hey @Niksko and @lantrix!

this seems like an awesome idea, and it'll solve a bunch of problems that buildkite users have, so go for your life if you're interested in implementing something :)

something that's on the roadmap for the BK Agent at the moment is some sort of configurable healthchecking, which ties into this idea a little bit. productionising this is a fair ways off at this point though, and i think something like what you've proposed would slot in nicely alongside some sort of more robust healthchecking system.

joshross12 commented 2 years ago

+1