buildkite / elastic-ci-stack-for-aws

An auto-scaling cluster of build agents running in your own AWS VPC
https://buildkite.com/docs/quickstart/elastic-ci-stack-aws
MIT License
414 stars 265 forks source link

Fix windows agent not restarting #1318

Closed patrobinson closed 2 months ago

patrobinson commented 2 months ago

Description

Steps to reproduce:

Observed behaviour: After the ScaleInIdlePeriod the EC2 instance is running but the buildkite-agent is in the SERVICE_STOPPED state.

Expected behaviour: the EC2 instance is running and the buildkite-agent is in SERVICE_STARTED state.

Analysis

This bug was likely introduced in https://github.com/buildkite/elastic-ci-stack-for-aws/commit/c3ebaa5c45995471e3035838e1ecfb68be493b1a

First it's important to understand how the service is configured in windows. We configure the default behaviour on exit to Restart with a 10s delay. We also configure the terminate-instance script to run once the service stops.

Here's the sequence of events I've pieced together based on log outputs, the nssm source code and Windows documentation

Changes

Don't attempt to stop the agent before terminating the instance, since it is asynchronous it doesn't complete before the start command is issued. Because of the restart delay it's unlikely to start and pick up a job before the ASG can scale it down, so the stop is not necessary.