First it's important to understand how the service is configured in windows. We configure the default behaviour on exit to Restart with a 10s delay.
We also configure the terminate-instance script to run once the service stops.
The agent exits code 0 once the idle timeout is reached
The terminate-instance script is started asynchronously
The service enters the SERVICE_PENDING
The terminate-instance script sends a STOP signal to the service but it's blocked by the throttled restart
After 10 seconds the throttled restart expires
The service enters the SERVICE_START_PENDING state
The STOP signal returns buildkite-agent: Unexpected status SERVICE_START_PENDING in response to STOP control. but queues the stop command.
The terminate-instance-in-auto-scaling-group API call fails because the ASG is already at it's MinSize
The terminate-instance script sends a START signal, but this fails with START: An instance of the service is already running.
The service processes the stop command and the agent stops
Changes
Don't attempt to stop the agent before terminating the instance, since it is asynchronous it doesn't complete before the start command is issued.
Because of the restart delay it's unlikely to start and pick up a job before the ASG can scale it down, so the stop is not necessary.
Description
Steps to reproduce:
ScaleInIdlePeriod
to some number greater than 0 (I used 6 to speed up the feedback)TerminateInstanceAfterJob
to falseterminate-instance-in-auto-scaling-group
fails)Observed behaviour: After the ScaleInIdlePeriod the EC2 instance is running but the buildkite-agent is in the
SERVICE_STOPPED
state.Expected behaviour: the EC2 instance is running and the buildkite-agent is in
SERVICE_STARTED
state.Analysis
This bug was likely introduced in https://github.com/buildkite/elastic-ci-stack-for-aws/commit/c3ebaa5c45995471e3035838e1ecfb68be493b1a
First it's important to understand how the service is configured in windows. We configure the default behaviour on exit to Restart with a 10s delay. We also configure the terminate-instance script to run once the service stops.
Here's the sequence of events I've pieced together based on log outputs, the nssm source code and Windows documentation
buildkite-agent: Unexpected status SERVICE_START_PENDING in response to STOP control.
but queues the stop command.terminate-instance-in-auto-scaling-group
API call fails because the ASG is already at it's MinSizeSTART: An instance of the service is already running.
Changes
Don't attempt to stop the agent before terminating the instance, since it is asynchronous it doesn't complete before the start command is issued. Because of the restart delay it's unlikely to start and pick up a job before the ASG can scale it down, so the stop is not necessary.