Open igagis opened 3 years ago
Today it happened with my raspberry_pi_3
runner as well. Also, I have seen it once with public macos-latest
runner.
from today it started giving the following message in the job, though the job is not picked by the runner:
as one can see it's been 5 minutes and the job is not picked, though the message clearly says that there is an online idle self-hosted runner.
in case you don't have odroid xu4
I can offer mine for analysis. I can give ssh
access to my odroid xu4
.
it looks like the runner fails the job immediately. in svc.sh status
output there are these lines:
Jul 04 21:31:29 xu4r1 runsvc.sh[1083]: 2021-07-04 18:31:29Z: Running job: linux - arm32v7/debian:buster
Jul 04 21:31:35 xu4r1 runsvc.sh[1083]: 2021-07-04 18:31:35Z: Job linux - arm32v7/debian:buster completed with result: Failed
but the job in github actions web ui still remains in "waiting for runner to pick up the job" state.
Question is why does the runner fail the job and why isn't it reflected in web ui?
Corresponding worker log: Worker_20210704-183130-utc.log
I suggest to upload the Runner Log(s) (same folder as worker logs in _diag) not only the log(s) of the Worker. Most likly the worker has been cancelled or killed by the runner. AFAIK The runner (not worker) is responsible for sending the jobcompleted event to github, in your case it fails or is not even tried.
Thanks for the suggestion!
Attaching Runner log: Runner_20210701-171501-utc.log
The relevant log lines for 20210704-183130 job seem to be the following:
[2021-07-04 18:31:29Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/ivan/actions-runner/.credentials_rsaparams
[2021-07-04 18:31:29Z INFO MessageListener] Message '292' received from session 'b24b2646-f5cc-4292-84a8-1ebca712eac2'.
[2021-07-04 18:31:29Z INFO JobDispatcher] Job request 28358 for plan eb7fcd95-47cc-4bb3-a597-dba4c0450ee6 job 5c6ded40-5d0a-567d-b3cc-38faeb055c98 received.
[2021-07-04 18:31:29Z INFO JobDispatcher] Pull OrchestrationId eb7fcd95-47cc-4bb3-a597-dba4c0450ee6.linux.debian_buster_arm32v7__deb_arm_docker from JWT claims
[2021-07-04 18:31:29Z INFO JobDispatcher] Job request 557db168-971f-5d68-f8c4-618a1704386d processed succeed.
[2021-07-04 18:31:29Z INFO Terminal] WRITE LINE: 2021-07-04 18:31:29Z: Running job: linux - arm32v7/debian:buster
[2021-07-04 18:31:29Z INFO JobDispatcher] Start renew job request 28358 for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98.
[2021-07-04 18:31:29Z INFO JobDispatcher] Successfully renew job request 28358, job is valid till 7/4/2021 6:41:29 PM
[2021-07-04 18:31:29Z INFO HostContext] Well known directory 'Bin': '/home/ivan/actions-runner/bin'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper] Starting process:
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper] File name: '/home/ivan/actions-runner/bin/Runner.Worker'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper] Arguments: 'spawnclient 104 108'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper] Working directory: '/home/ivan/actions-runner/bin'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper] Require exit code zero: 'False'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper] Encoding web name: ; code page: ''
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper] Force kill process on cancellation: 'True'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper] Redirected STDIN: 'False'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper] Persist current code page: 'False'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper] Keep redirected STDIN open: 'False'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper] High priority process: 'True'
[2021-07-04 18:31:29Z INFO ProcessInvokerWrapper] Process started with process id 28111, waiting for process exit.
[2021-07-04 18:31:29Z INFO JobDispatcher] Send job request message to worker for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98.
[2021-07-04 18:31:29Z INFO JobNotification] Entering JobStarted Notification
[2021-07-04 18:31:29Z INFO JobNotification] Entering StartMonitor
[2021-07-04 18:31:34Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2021-07-04 18:31:34Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2021-07-04 18:31:34Z INFO ProcessInvokerWrapper] Finished process 28111 with exit code 139, and elapsed time 00:00:04.6575405.
[2021-07-04 18:31:34Z INFO JobDispatcher] Worker finished for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98. Code: 139
[2021-07-04 18:31:34Z INFO JobDispatcher] Return code 139 indicate worker encounter an unhandled exception or app crash, attach worker stdout/stderr to JobRequest result.
[2021-07-04 18:31:34Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[2021-07-04 18:31:34Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[2021-07-04 18:31:35Z INFO JobDispatcher] finish job request for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98 with result: Failed
[2021-07-04 18:31:35Z INFO Terminal] WRITE LINE: 2021-07-04 18:31:35Z: Job linux - arm32v7/debian:buster completed with result: Failed
[2021-07-04 18:31:35Z INFO JobDispatcher] Stop renew job request for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98.
[2021-07-04 18:31:35Z INFO JobDispatcher] job renew has been canceled, stop renew job request 28358.
[2021-07-04 18:31:35Z INFO JobNotification] Entering JobCompleted Notification
[2021-07-04 18:31:35Z INFO JobNotification] Entering EndMonitor
From these it is visible that worker exited with code 139 in about 4 seconds. Which means:
[2021-07-04 18:31:34Z INFO JobDispatcher] Return code 139 indicate worker encounter an unhandled exception or app crash, attach worker stdout/stderr to JobRequest result.
So, it looks like worker crashes for some reason. And the job failure is not reported to github for some reason...
Fresh Runner
logs:
[2021-07-14 19:47:17Z INFO ProcessInvokerWrapper] Process started with process id 2338, waiting for process exit.
[2021-07-14 19:47:17Z INFO JobDispatcher] Send job request message to worker for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98.
[2021-07-14 19:47:17Z INFO JobNotification] Entering JobStarted Notification
[2021-07-14 19:47:17Z INFO JobNotification] Entering StartMonitor
[2021-07-14 19:47:21Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2021-07-14 19:47:21Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2021-07-14 19:47:21Z INFO ProcessInvokerWrapper] Finished process 2338 with exit code 134, and elapsed time 00:00:04.6299561.
[2021-07-14 19:47:21Z INFO JobDispatcher] Worker finished for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98. Code: 134
[2021-07-14 19:47:21Z INFO JobDispatcher] Return code 134 indicate worker encounter an unhandled exception or app crash, attach worker stdout/stderr to JobRequest result.
[2021-07-14 19:47:21Z INFO GitHubActionsService] Starting operation Location.GetConnectionData
[2021-07-14 19:47:22Z INFO GitHubActionsService] Finished operation Location.GetConnectionData
[2021-07-14 19:47:22Z INFO JobDispatcher] finish job request for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98 with result: Failed
[2021-07-14 19:47:22Z INFO Terminal] WRITE LINE: 2021-07-14 19:47:22Z: Job linux - arm32v7/debian:buster completed with result: Failed
[2021-07-14 19:47:22Z INFO JobDispatcher] Stop renew job request for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98.
[2021-07-14 19:47:22Z INFO JobDispatcher] job renew has been canceled, stop renew job request 31914.
[2021-07-14 19:47:22Z ERR JobDispatcher] Unhandled exception happened in worker:
[2021-07-14 19:47:22Z ERR JobDispatcher] Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at GitHub.Runner.Sdk.IOUtil+<>c__DisplayClass9_0.<DeleteDirectory>b__0(System.IO.FileSystemInfo)
[2021-07-14 19:47:22Z INFO JobNotification] Entering JobCompleted Notification
[2021-07-14 19:47:22Z INFO JobNotification] Entering EndMonitor
Here it says:
[2021-07-14 19:47:22Z ERR JobDispatcher] Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at GitHub.Runner.Sdk.IOUtil+<>c__DisplayClass9_0.<DeleteDirectory>b__0(System.IO.FileSystemInfo)
Looks like it tried to delete some directory and System.AccessViolationException
happened... hm...
sometimes it fails with code 132:
[2021-07-14 20:00:53Z INFO JobDispatcher] Worker finished for job 5c6ded40-5d0a-567d-b3cc-38faeb055c98. Code: 132
[2021-07-14 20:00:53Z INFO JobDispatcher] Return code 132 indicate worker encounter an unhandled exception or app crash, attach worker stdout/stderr to JobRequest result.
And in this cases there is no information about System.AccessViolationException
.
Log from the post above: https://github.com/actions/runner/issues/1168#issuecomment-874038315 mentions exit code 139.
So, at the moment I have observed exit codes 132, 134, 139. Or is it a random number at all?
For me it looks like dotnet has a device specfic bug. I used this runner on my raspberry Pi4 arm32 / arm64 just fine without loosing any jobs.
Or is it a random number at all
I think it's the unix signal number + 128,
Not directly related to this issue...
In the meantime, if you want to try my reverse engineered third party runner github-actions-act-runner, it most likly doesn't have this System.AccessViolationException
errors.
node
in your PATH
variable (It is based on act), if you aren't using a job containerI'd be happy to use alternative runner implementation if that allowed me to run my jobs.
I'm using job containers and I'm using composite actions
, I'm not using docker actions.
So, looks like, unfortunately, your runner does not suite my needs at its current level of feature support :(.
I'm using job containers and I'm using composite actions, I'm not using docker actions.
I'm not quite shure if you understand the feature level correctly. Job container and composite actions both are working for me.
You can only use docker actions if you are using a job container
I understood this as if I use job container then I can use only docker actions
Hehe, but now I see that you meant that docker actions are only working with job container.
I'll give it a try then!
I have quickly checked the https://github.com/ChristopherHX/github-actions-act-runner and it definitely doesn't have such problem. Evaluating it to start using on a daily basis.
Today I get similar behavior from hosted linux runners:
As for the self-hosted runner problem... I moved to using https://github.com/ChristopherHX/github-act-runner and after some work together with author it is now pretty usable, can be installed from deb
-repository for variety of architectures and configured using friendly command line tool.
Describe the bug I have a self-hosted runner on
odroid xu4
ARM single board computer and it quite often does not start the job. I also have self-hosted runner onraspberry pi 3
and there is no such problem with it.To Reproduce Reproducibility
~50%
Steps to reproduce the behavior:
odroid xu4
odroid xu4
remains in "Starting workflow run..." state for about 10 minutes, then it fails.Note: The runner itself in that time is marked as "Active".
Looks like the runner picks the job, but does not start it, since when
raspberry pi 3
runner gets freed it does not pick that stalled job, so it looks like it is considered already picked byodroid xu4
.Expected behavior The runner on
odroid xu4
board should pick up the job and execute it.Runner Version and Platform
runner version
2.278.0
OS: Armbian Buster (Linux)
What's not working?
The job is not started.
Job Log Output
Runner and Worker's Diagnostic Logs
Worker_20210622-121039-utc.log