GitHub Actions self-hosted runner keeps stuck between jobs

yeputons commented 2 years ago

Describe the bug I have an organization with ~100 private repositories for a programming course (one repository per student). There are some organization-wide self-hosted runners (both on Windows and Linux). Each student may push to their repository and trigger 14 GitHub Actions-based jobs which check their assignment in various scenarios.

Today I've discovered that runners are getting stuck on a regular basis:

The runner is up and running.
The last message in runner's log is either "Job ..... completed with result: ..." or "Listening for Jobs". No extra processes are running and no jobs are picked even though there are lots of things to do.
Organization-wide settings show that the runner is running some job, i.e. it's "Active".
If I stop the runner, it goes "Offline" on the dashboard, but the job is still "running" and I'm unable to remove the runner because the job is still running. Both config.cmd and "Force remove" produce the same error.

In all cases, canceling either the job or the workflow does nothing: cancellation is "requested", but nothing happens, even if there are exactly 0 runners online, both according to the GitHub organization's "Actions / Runners" section and according to list of processes on each machine. Cancellation does happen, randomly, after about 5 minutes.

A runner may be in such state for five minutes easily. Some get unstuck automatically, some have to be restarted once or even multiple times. A runner may work fine for half-hour and then get stuck. It may get stuck after running a few jobs only. Typically all runners get stuck at the same time or very close to each other, but sometimes some are stuck and some continue running.

Unfortunately, I was unable to find a tendency: all machines have plenty of disk space, even a runner with freshly removed _work/_diag directories may get immediately stuck.

To Reproduce Steps to reproduce the behavior:

Start a bunch of self-hosted runners version 283.1: 1 on one Linux machine, 4 on another Linux machine, 1 on a Windows machine.
Wait for about 15 minutes.
One of runners gets randomly stuck, although the "Runners" section under organization's settings shows the a job is assigned and running.

Expected behavior Runner either picks up jobs regularly or says that something is wrong with disk space/connection/long polling/whatever.

Runner Version and Platform

Version of your runner? 283.1

OS of the machine running the runner? Windows and Linux, problem occur on both.

What's not working?

Please include error messages and screenshots.

Job Log Output

Not a problem with a specific job.

Runner and Worker's Diagnostic Logs

Worker logs are on per-job basis and they look exactly the same for last jobs before getting stuck and previous ones.

Runner's logs between jobs:

[2021-09-27 19:41:57Z INFO Terminal] WRITE LINE: 2021-09-27 19:41:57Z: Job clang-tidy completed with result: Succeeded
[2021-09-27 19:41:57Z INFO JobDispatcher] Stop renew job request for job 2d12411e-81da-5e1f-da81-838d5e7090c2.
[2021-09-27 19:41:57Z INFO JobDispatcher] job renew has been canceled, stop renew job request 37312.
[2021-09-27 19:41:57Z INFO JobNotification] Entering JobCompleted Notification
[2021-09-27 19:41:57Z INFO JobNotification] Entering EndMonitor
[2021-09-27 19:42:01Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/gh-runner/actions-runner/01/.cr
edentials_rsaparams
[2021-09-27 19:42:01Z INFO MessageListener] Message '3103' received from session '54269edf-e632-4c8a-b7a3-59f776ed2d0b'.
[2021-09-27 19:42:01Z INFO JobDispatcher] Job request 37308 for plan 5ebed12e-9534-4dd9-a025-9b548a6f7655 job d8746438-a
da5-5be2-4171-bc62b3cab2b7 received.
[2021-09-27 19:42:01Z INFO JobDispatcher] Pull OrchestrationId 5ebed12e-9534-4dd9-a025-9b548a6f7655.compile-run.g__-10_-
-version_-std_c__17_-wall_-wextra_-werror_-o_tests_hse-cpp-lin-prod_-g_-fsanitize_undefine_2 from JWT claims
[2021-09-27 19:42:01Z INFO JobDispatcher] Job request 2d12411e-81da-5e1f-da81-838d5e7090c2 processed succeed.
[2021-09-27 19:42:01Z INFO Terminal] WRITE LINE: 2021-09-27 19:42:01Z: Running job: compile-run (g++-10, --version, -std
=c++17 -Wall -Wextra -Werror -o tests, hse-cpp-lin-prod, -g -...
[2021-09-27 19:42:01Z INFO JobDispatcher] Start renew job request 37308 for job d8746438-ada5-5be2-4171-bc62b3cab2b7.
[2021-09-27 19:42:01Z INFO JobDispatcher] Successfully renew job request 37308, job is valid till 9/27/2021 7:52:01 PM
[2021-09-27 19:42:01Z INFO HostContext] Well known directory 'Bin': '/home/gh-runner/actions-runner/01/bin.2.283.1'
[2021-09-27 19:42:01Z INFO ProcessInvokerWrapper] Starting process:
[2021-09-27 19:42:01Z INFO ProcessInvokerWrapper]   File name: '/home/gh-runner/actions-runner/01/bin.2.283.1/Runner.Wor
ker'

Runner's last lines when stuck:

[2021-09-27 19:43:59Z INFO Terminal] WRITE LINE: 2021-09-27 19:43:59Z: Job compile-run (clang++-12, --version, -stdlib=libstdc++ -std=c++17 -Wall -Wextra -Werror -o tests, ... completed with result: Succeeded
[2021-09-27 19:43:59Z INFO JobDispatcher] Stop renew job request for job b0569718-4f62-5db2-a057-45bbc09af1c4.
[2021-09-27 19:43:59Z INFO JobDispatcher] job renew has been canceled, stop renew job request 37332.
[2021-09-27 19:43:59Z INFO JobNotification] Entering JobCompleted Notification
[2021-09-27 19:43:59Z INFO JobNotification] Entering EndMonitor

Logs on Windows are identical.

Restaring the runner helped unstuck it in that particular case, relevant logs around receiving new job (mind that the queue is full!):

[2021-09-27 19:49:18Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/gh-runner/actions-runner/01/.cr
edentials_rsaparams
[2021-09-27 19:49:18Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/gh-runner/actions-runner/01/.cr
edentials_rsaparams
[2021-09-27 19:49:19Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[2021-09-27 19:49:19Z INFO MessageListener] Session created.
[2021-09-27 19:49:19Z INFO Terminal] WRITE LINE: 2021-09-27 19:49:19Z: Listening for Jobs
[2021-09-27 19:49:19Z INFO JobDispatcher] Set runner/worker IPC timeout to 30 seconds.
[2021-09-27 19:49:23Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/gh-runner/actions-runner/01/.cr
edentials_rsaparams
[2021-09-27 19:49:23Z INFO MessageListener] Message '3106' received from session '717d30bd-0fc7-4c7f-a719-27467a9b408d'.
[2021-09-27 19:49:23Z INFO JobDispatcher] Job request 37423 for plan 255dd6bd-5720-4a4d-a23c-3be3ad40e1e5 job 71490d96-a
242-5f1c-bb38-546ccf5abc39 received.
[2021-09-27 19:49:23Z INFO JobDispatcher] Pull OrchestrationId 255dd6bd-5720-4a4d-a23c-3be3ad40e1e5.cppcheck.__default f
rom JWT claims
[2021-09-27 19:49:23Z INFO Terminal] WRITE LINE: 2021-09-27 19:49:23Z: Running job: cppcheck
[2021-09-27 19:49:23Z INFO JobDispatcher] Start renew job request 37423 for job 71490d96-a242-5f1c-bb38-546ccf5abc39.
[2021-09-27 19:49:23Z INFO JobDispatcher] Successfully renew job request 37423, job is valid till 9/27/2021 7:59:23 PM
[2021-09-27 19:49:23Z INFO HostContext] Well known directory 'Bin': '/home/gh-runner/actions-runner/01/bin.2.283.1'
[2021-09-27 19:49:23Z INFO ProcessInvokerWrapper] Starting process:
[2021-09-27 19:49:23Z INFO ProcessInvokerWrapper]   File name: '/home/gh-runner/actions-runner/01/bin.2.283.1/Runner.Worker'

yeputons commented 2 years ago

Here is another example:

Runner is stuck:

It is "Active" in dashboard:

It is running something according to GitHub:

And it got a job after 1.5m of waiting. I consider this an almost success

[2021-09-27 19:54:24Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/gh-runner/actions-runner/01/.cr
edentials_rsaparams
[2021-09-27 19:54:24Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/gh-runner/actions-runner/01/.cr
edentials_rsaparams
[2021-09-27 19:54:24Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[2021-09-27 19:54:25Z INFO MessageListener] Session created.
[2021-09-27 19:54:25Z INFO Terminal] WRITE LINE: 2021-09-27 19:54:25Z: Listening for Jobs
[2021-09-27 19:54:25Z INFO JobDispatcher] Set runner/worker IPC timeout to 30 seconds.
[2021-09-27 19:55:52Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/gh-runner/actions-runner/01/.credentials_rsaparams
[2021-09-27 19:55:52Z INFO MessageListener] Message '3117' received from session '22f1d42d-b3d1-4a80-9039-78f9f46b1d4a'.
[2021-09-27 19:55:52Z INFO JobDispatcher] Job request 37509 for plan 183aece2-14f8-468c-ae4b-f416655c7c91 job f67d49c6-4e6a-5df7-ab4d-b4b619afa22a received.
[2021-09-27 19:55:52Z INFO JobDispatcher] Pull OrchestrationId 183aece2-14f8-468c-ae4b-f416655c7c91.clang-format.__default from JWT claims
[2021-09-27 19:55:52Z INFO Terminal] WRITE LINE: 2021-09-27 19:55:52Z: Running job: clang-format

yeputons commented 2 years ago

Also, on Windows runner sometimes pressing Ctrl+C helps the runner get immediately unstuck instead of exiting. But maybe it's just I mouse-selected a part of the terminal accidentally and froze it, not sure.

yeputons commented 2 years ago

I suspect it may be related to a recent GitHub Actions incident, which started about 30 minute after I filed the issue. I was having issues for few hours before that, although it still may be related.

fhammerl commented 2 years ago

Hi 👋 @yeputons, Thanks for this great and detailed write-up on the issue you experienced. Please let us know if the issue happens again, independent from an incident.

yeputons commented 2 years ago

Two more Mondays with a similar load profile passed, no issues. So I think it was related to the incident.

Closing for now, I will reopen with new details if it happens again.

actions / runner