Instance should be marked as unhealthy if buildkite-agent it stopped

buildkite / elastic-ci-stack-for-aws

An auto-scaling cluster of build agents running in your own AWS VPC

https://buildkite.com/docs/quickstart/elastic-ci-stack-aws

MIT License

418 stars 271 forks source link

Instance should be marked as unhealthy if buildkite-agent it stopped #63

Closed toolmantim closed 3 years ago

toolmantim commented 8 years ago

I think you were right on this one @lox — seeing as there's no monit/upstart, if you remotely stop the buildkite-agent process then you're going to get the instance just sitting there, but without any agent running on it.

I imagine what should happen is that the instance is marked as unhealthy and then replaced as soon as that happens.

lox commented 8 years ago

A few options:

add a generic shutdown hook to the agent
setup a cron or some sort of process monitor baked into the AMI
add a process monitor feature to lifecycled that then called the shutdown API when the process died
add an aws-aware feature to the agent that marks and instance as unhealthy on shutdown, perhaps distinguishing between remote and local shutdown

toolmantim commented 8 years ago

Related: buildkite/agent#111 buildkite/agent#224

lox commented 8 years ago

Option 1 would pretty neatly solve a lot of those.

toolmantim commented 8 years ago

What if we just changed the command that starts the agent to buildkite-agent start; report-instance-as-unhealthy?

gugahoi commented 7 years ago

I know this is not directly related however I have also seen agents spin up when scalling and fail to register with buildkite. This results in the agent being up and running but not actually doing anything. Would it be sensible for the metrics stack to periodically check that an agent is registered with buildkite? In the case the agent is not registered then it should be shut down.

deppy commented 7 years ago

I'd +1 this, but also take it a step further. We've been seeing a number of cases where containerd restarts and then the job just continues to run and run (we've also seen cases where the job just stops output and fails, but I'm still trying to see if that's the same issue). So I'd have it checking the health of any of the common services required to be a healthy node and terminate the instance when it doesn't meet those requirements.

keithduncan commented 3 years ago

We now supervise buildkite-agent with systemd and have an ExecStopPost which terminates the instance. Hopefully this isn’t happening any more! 😄