MessageQueueToken error - failed to start message queue listener #3591

Controller Version


Deployment Method



To Reproduce

We're seeing issues after runner pods have been running for a long time, where the listener seems to fail and terminates/fails to bring up new runner pods. The error we're seeing in the listener is below which seems to point to an issue refreshing the messagequeue token. It seems to require us to delete the full helm release and re-install to get things working again.

2024-06-11T18:39:56Z    ERROR   Error encountered   {"error": "failed to start message queue listener: could not get and process message. get message failed from refreshing client. get message failed. Get \"\": GET giving up after 1 attempt(s): context canceled"}

Describe the bug

Runner scale set listener fails to refresh and gets stuck.

Describe the expected behavior

Runner listener should be able to refresh its client

Additional Context

Controller Logs

Runner Pod Logs
Looks like this may have been a consequence of the Github wide issue with the API and Github Actions

Im not sure if the issues I saw today are related to this, but after the outage today, all of my runners got stuck in a bad state and have since been completely unrecoverable. I had to create an entirely new runner set to try and get CI back up and running properly

Hey everyone, I'm fairly certain this one was due to an incident, so I will close it now. Please let us know if the issue persists :relaxed: