Open robinjam opened 4 years ago
This is a common situation with a VM based buildfarm. If the VM host is shut down for a while all the guests are suspended. When they're woken up all the buildkite agents in guests are in this permanently failed state until they're manually restarted.
Having the agent reauthenticate in response to a "401 Invalid access token" response, either in any context or even just in response to a ping or heartbeat request would be a nice improvement.
Having it exit, so an external process watcher can restart it, wouldn't be quite as good - particularly for Windows and macOS users, who are more likely to be running it ad-hoc rather than under a process watcher - but would be significantly better than the current situation.
The sleeping VM use case is an interesting one. I haven't heard of many customers using that model, but it's totally a valid way to run the agent and it'd be great to handle it better.
Both of the suggested approaches sound workable. I think we might lean slightly towards the "exit on invalid access token" approach, mainly because it's simpler. It does mean we'll rely on an init system to restart, but I think the vast majority of running agents will have something managing them.
We're pretty snowed under at the moment, but if you do find time to work on a PR we'll be happy to review it!
The relevant code is here: https://github.com/buildkite/agent/blob/750b8063a5f65629dfb2b7c3232be08f554f7e9f/agent/agent_worker.go#L348-L354
I haven't tested this at all, but I note further down that function we stop the agent with a.Stop(false)
. It'd be interesting to see what happens if that's called after an invalid access token error.
One of my Buildkite agents spends a lot of time asleep. Often when I wake it after several days of sleeping, the agent process fails to reconnect and the log fills up with entries like this:
Restarting the agent process fixes the issue, but I think the agent should be able to recover from this condition instead of simply retrying the invalid access token indefinitely.
Buildkite agent v3.22.1, Windows 10 build 19041.329