Agent fails to reconnect with 401 Invalid access token after being offline for long periods of time

buildkite / agent

The Buildkite Agent is an open-source toolkit written in Go for securely running build jobs on any device or network

https://buildkite.com/

MIT License

810 stars 300 forks source link

Agent fails to reconnect with 401 Invalid access token after being offline for long periods of time #1245

Open robinjam opened 4 years ago

robinjam commented 4 years ago

One of my Buildkite agents spends a lot of time asleep. Often when I wake it after several days of sleeping, the agent process fails to reconnect and the log fills up with entries like this:

...
2020-07-14 21:53:04 WARN Desktop PC Failed to ping: GET https://agent.buildkite.com/v3/ping: 401 Invalid access token (Last successful was 116h25m52.6913959s ago)
2020-07-14 21:53:05 WARN Desktop PC POST https://agent.buildkite.com/v3/heartbeat: 401 Invalid access token (Attempt 2/5 Retrying in 5s)
...

Restarting the agent process fixes the issue, but I think the agent should be able to recover from this condition instead of simply retrying the invalid access token indefinitely.

Buildkite agent v3.22.1, Windows 10 build 19041.329

wttw commented 4 years ago

This is a common situation with a VM based buildfarm. If the VM host is shut down for a while all the guests are suspended. When they're woken up all the buildkite agents in guests are in this permanently failed state until they're manually restarted.

Having the agent reauthenticate in response to a "401 Invalid access token" response, either in any context or even just in response to a ping or heartbeat request would be a nice improvement.

Having it exit, so an external process watcher can restart it, wouldn't be quite as good - particularly for Windows and macOS users, who are more likely to be running it ad-hoc rather than under a process watcher - but would be significantly better than the current situation.

yob commented 4 years ago

The sleeping VM use case is an interesting one. I haven't heard of many customers using that model, but it's totally a valid way to run the agent and it'd be great to handle it better.

Both of the suggested approaches sound workable. I think we might lean slightly towards the "exit on invalid access token" approach, mainly because it's simpler. It does mean we'll rely on an init system to restart, but I think the vast majority of running agents will have something managing them.

We're pretty snowed under at the moment, but if you do find time to work on a PR we'll be happy to review it!

The relevant code is here: https://github.com/buildkite/agent/blob/750b8063a5f65629dfb2b7c3232be08f554f7e9f/agent/agent_worker.go#L348-L354

I haven't tested this at all, but I note further down that function we stop the agent with a.Stop(false). It'd be interesting to see what happens if that's called after an invalid access token error.