Closed toolmantim closed 3 years ago
A few options:
Related: buildkite/agent#111 buildkite/agent#224
Option 1 would pretty neatly solve a lot of those.
What if we just changed the command that starts the agent to buildkite-agent start; report-instance-as-unhealthy
?
I know this is not directly related however I have also seen agents spin up when scalling and fail to register with buildkite. This results in the agent being up and running but not actually doing anything. Would it be sensible for the metrics stack to periodically check that an agent is registered with buildkite? In the case the agent is not registered then it should be shut down.
I'd +1 this, but also take it a step further. We've been seeing a number of cases where containerd restarts and then the job just continues to run and run (we've also seen cases where the job just stops output and fails, but I'm still trying to see if that's the same issue). So I'd have it checking the health of any of the common services required to be a healthy node and terminate the instance when it doesn't meet those requirements.
We now supervise buildkite-agent with systemd and have an ExecStopPost
which terminates the instance. Hopefully this isn’t happening any more! 😄
I think you were right on this one @lox — seeing as there's no monit/upstart, if you remotely stop the buildkite-agent process then you're going to get the instance just sitting there, but without any agent running on it.
I imagine what should happen is that the instance is marked as unhealthy and then replaced as soon as that happens.