Closed yeputons closed 2 years ago
Here is another example:
Runner is stuck:
It is "Active" in dashboard:
It is running something according to GitHub:
And it got a job after 1.5m of waiting. I consider this an almost success
[2021-09-27 19:54:24Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/gh-runner/actions-runner/01/.cr
edentials_rsaparams
[2021-09-27 19:54:24Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/gh-runner/actions-runner/01/.cr
edentials_rsaparams
[2021-09-27 19:54:24Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[2021-09-27 19:54:25Z INFO MessageListener] Session created.
[2021-09-27 19:54:25Z INFO Terminal] WRITE LINE: 2021-09-27 19:54:25Z: Listening for Jobs
[2021-09-27 19:54:25Z INFO JobDispatcher] Set runner/worker IPC timeout to 30 seconds.
[2021-09-27 19:55:52Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/gh-runner/actions-runner/01/.credentials_rsaparams
[2021-09-27 19:55:52Z INFO MessageListener] Message '3117' received from session '22f1d42d-b3d1-4a80-9039-78f9f46b1d4a'.
[2021-09-27 19:55:52Z INFO JobDispatcher] Job request 37509 for plan 183aece2-14f8-468c-ae4b-f416655c7c91 job f67d49c6-4e6a-5df7-ab4d-b4b619afa22a received.
[2021-09-27 19:55:52Z INFO JobDispatcher] Pull OrchestrationId 183aece2-14f8-468c-ae4b-f416655c7c91.clang-format.__default from JWT claims
[2021-09-27 19:55:52Z INFO Terminal] WRITE LINE: 2021-09-27 19:55:52Z: Running job: clang-format
Also, on Windows runner sometimes pressing Ctrl+C helps the runner get immediately unstuck instead of exiting. But maybe it's just I mouse-selected a part of the terminal accidentally and froze it, not sure.
I suspect it may be related to a recent GitHub Actions incident, which started about 30 minute after I filed the issue. I was having issues for few hours before that, although it still may be related.
Hi 👋 @yeputons, Thanks for this great and detailed write-up on the issue you experienced. Please let us know if the issue happens again, independent from an incident.
Two more Mondays with a similar load profile passed, no issues. So I think it was related to the incident.
Closing for now, I will reopen with new details if it happens again.
Describe the bug I have an organization with ~100 private repositories for a programming course (one repository per student). There are some organization-wide self-hosted runners (both on Windows and Linux). Each student may push to their repository and trigger 14 GitHub Actions-based jobs which check their assignment in various scenarios.
Today I've discovered that runners are getting stuck on a regular basis:
config.cmd
and "Force remove" produce the same error.In all cases, canceling either the job or the workflow does nothing: cancellation is "requested", but nothing happens, even if there are exactly 0 runners online, both according to the GitHub organization's "Actions / Runners" section and according to list of processes on each machine. Cancellation does happen, randomly, after about 5 minutes.
A runner may be in such state for five minutes easily. Some get unstuck automatically, some have to be restarted once or even multiple times. A runner may work fine for half-hour and then get stuck. It may get stuck after running a few jobs only. Typically all runners get stuck at the same time or very close to each other, but sometimes some are stuck and some continue running.
Unfortunately, I was unable to find a tendency: all machines have plenty of disk space, even a runner with freshly removed
_work
/_diag
directories may get immediately stuck.To Reproduce Steps to reproduce the behavior:
Expected behavior Runner either picks up jobs regularly or says that something is wrong with disk space/connection/long polling/whatever.
Runner Version and Platform
Version of your runner? 283.1
OS of the machine running the runner? Windows and Linux, problem occur on both.
What's not working?
Please include error messages and screenshots.
Job Log Output
Not a problem with a specific job.
Runner and Worker's Diagnostic Logs
Worker logs are on per-job basis and they look exactly the same for last jobs before getting stuck and previous ones.
Runner's logs between jobs:
Runner's last lines when stuck:
Logs on Windows are identical.
Restaring the runner helped unstuck it in that particular case, relevant logs around receiving new job (mind that the queue is full!):