self-hosted runner on odroid xu4 doesn't always start the job

igagis commented 3 years ago

Describe the bug I have a self-hosted runner on odroid xu4 ARM single board computer and it quite often does not start the job. I also have self-hosted runner on raspberry pi 3 and there is no such problem with it.

To Reproduce Reproducibility ~50%

Steps to reproduce the behavior:

Set up the self-hosted runner on odroid xu4
Make a commit to the repo to start the workflow
The job which is supposed to be run by odroid xu4 remains in "Starting workflow run..." state for about 10 minutes, then it fails.

Note: The runner itself in that time is marked as "Active".

Looks like the runner picks the job, but does not start it, since when raspberry pi 3 runner gets freed it does not pick that stalled job, so it looks like it is considered already picked by odroid xu4.

Expected behavior The runner on odroid xu4 board should pick up the job and execute it.

Runner Version and Platform

runner version 2.278.0

OS: Armbian Buster (Linux)

What's not working?

The job is not started.

Job Log Output

Runner and Worker's Diagnostic Logs

Worker_20210622-121039-utc.log

igagis commented 3 years ago

Today it happened with my raspberry_pi_3 runner as well. Also, I have seen it once with public macos-latest runner.

igagis commented 3 years ago

from today it started giving the following message in the job, though the job is not picked by the runner:

as one can see it's been 5 minutes and the job is not picked, though the message clearly says that there is an online idle self-hosted runner.

igagis commented 3 years ago

in case you don't have odroid xu4 I can offer mine for analysis. I can give ssh access to my odroid xu4.

igagis commented 3 years ago

it looks like the runner fails the job immediately. in svc.sh status output there are these lines:

Jul 04 21:31:29 xu4r1 runsvc.sh[1083]: 2021-07-04 18:31:29Z: Running job: linux - arm32v7/debian:buster
Jul 04 21:31:35 xu4r1 runsvc.sh[1083]: 2021-07-04 18:31:35Z: Job linux - arm32v7/debian:buster completed with result: Failed

but the job in github actions web ui still remains in "waiting for runner to pick up the job" state.

Question is why does the runner fail the job and why isn't it reflected in web ui?

Corresponding worker log: Worker_20210704-183130-utc.log

ChristopherHX commented 3 years ago

I suggest to upload the Runner Log(s) (same folder as worker logs in _diag) not only the log(s) of the Worker. Most likly the worker has been cancelled or killed by the runner. AFAIK The runner (not worker) is responsible for sending the jobcompleted event to github, in your case it fails or is not even tried.

igagis commented 3 years ago

Thanks for the suggestion!

Attaching Runner log: Runner_20210701-171501-utc.log

The relevant log lines for 20210704-183130 job seem to be the following:

[2021-07-04 18:31:29Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/ivan/actions-runner/.credentials_rsaparams
[2021-07-04 18:31:29Z INFO MessageListener] Message '292' received from session 'b24b2646-f5cc-4292-84a8-1ebca712eac2'.
[2021-07-04 18:31:29Z INFO JobDispatcher] Job request 28358 for plan eb7fcd95-47cc-4bb3-a597-dba4c0450ee6 job 5c6ded40-5d0a-567d-b3cc-38faeb055c98 received.
[2021-07-04 18:31:29Z INFO JobDispatcher] Pull OrchestrationId eb7fcd95-47cc-4bb3-a597-dba4c0450ee6.linux.debian_buster_arm32v7__deb_arm_docker from JWT claims
[2021-07-04 18:31:29Z INFO JobDispatcher] Job request 557db168-971f-5d68-f8c4-618a1704386d processed succeed.
[2021-07-04 18:31:29Z INFO Terminal] WRITE LINE: 2021-07-04 18:31:29Z: Running job: linux - arm32v7/debian:buster
[2021-07-04 18:31:29Z INFO JobDispatcher] Start renew job request 28358 for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98.
[2021-07-04 18:31:29Z INFO JobDispatcher] Successfully renew job request 28358, job is valid till 7/4/2021 6:41:29 PM
[2021-07-04 18:31:29Z INFO HostContext] Well known directory 'Bin': '/home/ivan/actions-runner/bin'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper] Starting process:
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper]   File name: '/home/ivan/actions-runner/bin/Runner.Worker'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper]   Arguments: 'spawnclient 104 108'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper]   Working directory: '/home/ivan/actions-runner/bin'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper]   Require exit code zero: 'False'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper]   Encoding web name:  ; code page: ''
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper]   Force kill process on cancellation: 'True'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper]   Redirected STDIN: 'False'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper]   Persist current code page: 'False'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper]   Keep redirected STDIN open: 'False'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper]   High priority process: 'True'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper] Process started with process id 28111, waiting for process exit.
[2021-07-04 18:31:29Z INFO JobDispatcher] Send job request message to worker for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98.
[2021-07-04 18:31:29Z INFO JobNotification] Entering JobStarted Notification
[2021-07-04 18:31:29Z INFO JobNotification] Entering StartMonitor
[2021-07-04 18:31:34Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2021-07-04 18:31:34Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2021-07-04 18:31:34Z INFO ProcessInvokerWrapper] Finished process 28111 with exit code 139, and elapsed time 00:00:04.6575405.
[2021-07-04 18:31:34Z INFO JobDispatcher] Worker finished for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98. Code: 139
[2021-07-04 18:31:34Z INFO JobDispatcher] Return code 139 indicate worker encounter an unhandled exception or app crash, attach worker stdout/stderr to JobRequest result.
[2021-07-04 18:31:34Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[2021-07-04 18:31:34Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[2021-07-04 18:31:35Z INFO JobDispatcher] finish job request for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98 with result: Failed
[2021-07-04 18:31:35Z INFO Terminal] WRITE LINE: 2021-07-04 18:31:35Z: Job linux - arm32v7/debian:buster completed with result: Failed
[2021-07-04 18:31:35Z INFO JobDispatcher] Stop renew job request for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98.
[2021-07-04 18:31:35Z INFO JobDispatcher] job renew has been canceled, stop renew job request 28358.
[2021-07-04 18:31:35Z INFO JobNotification] Entering JobCompleted Notification
[2021-07-04 18:31:35Z INFO JobNotification] Entering EndMonitor

From these it is visible that worker exited with code 139 in about 4 seconds. Which means:

[2021-07-04 18:31:34Z INFO JobDispatcher] Return code 139 indicate worker encounter an unhandled exception or app crash, attach worker stdout/stderr to JobRequest result.

So, it looks like worker crashes for some reason. And the job failure is not reported to github for some reason...

igagis commented 3 years ago

Fresh Runner logs:

[2021-07-14 19:47:17Z INFO ProcessInvokerWrapper] Process started with process id 2338, waiting for process exit.
[2021-07-14 19:47:17Z INFO JobDispatcher] Send job request message to worker for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98.
[2021-07-14 19:47:17Z INFO JobNotification] Entering JobStarted Notification
[2021-07-14 19:47:17Z INFO JobNotification] Entering StartMonitor
[2021-07-14 19:47:21Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2021-07-14 19:47:21Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2021-07-14 19:47:21Z INFO ProcessInvokerWrapper] Finished process 2338 with exit code 134, and elapsed time 00:00:04.6299561.
[2021-07-14 19:47:21Z INFO JobDispatcher] Worker finished for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98. Code: 134
[2021-07-14 19:47:21Z INFO JobDispatcher] Return code 134 indicate worker encounter an unhandled exception or app crash, attach worker stdout/stderr to JobRequest result.
[2021-07-14 19:47:21Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[2021-07-14 19:47:22Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[2021-07-14 19:47:22Z INFO JobDispatcher] finish job request for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98 with result: Failed
[2021-07-14 19:47:22Z INFO Terminal] WRITE LINE: 2021-07-14 19:47:22Z: Job linux - arm32v7/debian:buster completed with result: Failed
[2021-07-14 19:47:22Z INFO JobDispatcher] Stop renew job request for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98.
[2021-07-14 19:47:22Z INFO JobDispatcher] job renew has been canceled, stop renew job request 31914.
[2021-07-14 19:47:22Z ERR  JobDispatcher] Unhandled exception happened in worker:
[2021-07-14 19:47:22Z ERR  JobDispatcher] Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at GitHub.Runner.Sdk.IOUtil+<>c__DisplayClass9_0.<DeleteDirectory>b__0(System.IO.FileSystemInfo)
[2021-07-14 19:47:22Z INFO JobNotification] Entering JobCompleted Notification
[2021-07-14 19:47:22Z INFO JobNotification] Entering EndMonitor

Here it says:

[2021-07-14 19:47:22Z ERR  JobDispatcher] Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at GitHub.Runner.Sdk.IOUtil+<>c__DisplayClass9_0.<DeleteDirectory>b__0(System.IO.FileSystemInfo)

Looks like it tried to delete some directory and System.AccessViolationException happened... hm...

igagis commented 3 years ago

sometimes it fails with code 132:

[2021-07-14 20:00:53Z INFO JobDispatcher] Worker finished for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98. Code: 132
[2021-07-14 20:00:53Z INFO JobDispatcher] Return code 132 indicate worker encounter an unhandled exception or app crash, attach worker stdout/stderr to JobRequest result.

And in this cases there is no information about System.AccessViolationException.

Log from the post above: https://github.com/actions/runner/issues/1168#issuecomment-874038315 mentions exit code 139.

So, at the moment I have observed exit codes 132, 134, 139. Or is it a random number at all?

ChristopherHX commented 3 years ago

For me it looks like dotnet has a device specfic bug. I used this runner on my raspberry Pi4 arm32 / arm64 just fine without loosing any jobs.

Or is it a random number at all

I think it's the unix signal number + 128,

132 is signal 4 invalid instruction
134 is signal 6 abort was called
139 is signal 11 segmentation fault

Not directly related to this issue... In the meantime, if you want to try my reverse engineered third party runner github-actions-act-runner, it most likly doesn't have this System.AccessViolationException errors.

Make shure to download node12 and add it as node in your PATH variable (It is based on act), if you aren't using a job container
It has limited functionality and might still be a bit unstable, see issues of act
You can only use docker actions if you are using a job container
You can find prebuilds in the release v0.0.2

igagis commented 3 years ago

I'd be happy to use alternative runner implementation if that allowed me to run my jobs. I'm using job containers and I'm using composite actions, I'm not using docker actions. So, looks like, unfortunately, your runner does not suite my needs at its current level of feature support :(.

ChristopherHX commented 3 years ago

I'm using job containers and I'm using composite actions, I'm not using docker actions.

I'm not quite shure if you understand the feature level correctly. Job container and composite actions both are working for me.

igagis commented 3 years ago

You can only use docker actions if you are using a job container

I understood this as if I use job container then I can use only docker actions

Hehe, but now I see that you meant that docker actions are only working with job container.

I'll give it a try then!

igagis commented 3 years ago

I have quickly checked the https://github.com/ChristopherHX/github-actions-act-runner and it definitely doesn't have such problem. Evaluating it to start using on a daily basis.

igagis commented 3 years ago

Today I get similar behavior from hosted linux runners:

As for the self-hosted runner problem... I moved to using https://github.com/ChristopherHX/github-act-runner and after some work together with author it is now pretty usable, can be installed from deb-repository for variety of architectures and configured using friendly command line tool.

actions / runner