Open kabamawutschnik opened 3 years ago
@kabamawutschnik have you check syslog
to confirm the runner get killed by OOM Killer?
grep -i kill /var/log/syslog
You can also manually modify the oom_score_adj
for the runner process before it takes any jobs.
@TingluoHuang I had to install rsyslog and start it (with a default conf), to start getting syslog. After doing so, I ran the workflow again and ran your grep command. I did not find any results.
I set the oom_score_adj for the github runner service, to -999. Re running the workflow again, resulted in the same job cancellation.
I forgot to note: The exact same workflow runs to completion in a Github hosted runner.
@kabamawutschnik do you mind share the full runner diag logs for both runner.listener and runner.worker?
The runner cancel jobs only when
I believe these are the ones.
Worker_20210119-002009-utc.log Runner_20210119-002002-utc.log
[2021-01-19 00:21:54Z INFO Runner] Received Ctrl-C signal, stop Runner.Listener and Runner.Worker.
[2021-01-19 00:21:54Z INFO HostContext] Runner will be shutdown for UserCancelled
@kabamawutschnik something sends a Ctrl-C to the runner while it's running.
@TingluoHuang Strange.. The exact same workflow runs fully on a github hosted runner. Are github hosted runners some how more resilient to kill commands?
Searching through setup.sh I found a location where we run killall mongod && killall mongos && killall -9 mongos
. The job appears to end just after this portion of code.
I did some testing.
I commented out a bunch of ./bin/mongo
calls, and made it past the same portion of code that I was exiting on. Job is still exiting the same way, but seems to progress further.
Could this be a result of child process management?
@kabamawutschnik try to figure out what process sends the SIGINT?
Attached to process 655 (worker process), & tracing all the child processes of 655. Looks like the worker sends SIGINT then SIGTERM.
Note: I notice that our github runner service dies after the exited job. I need to restart it with sudo systemctl start 'actions.servicename'
2671 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2676, si_uid=1000, si_status=0, si_utime=0, si_stime=0} --- 2679 +++ exited with 0 +++ 2666 --- SIGINT {si_signo=SIGINT, si_code=SI_USER, si_pid=655, si_uid=1000} --- 2680 +++ exited with 0 +++ 2678 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2679, si_uid=1000, si_status=0, si_utime=0, si_stime=0} --- 2678 +++ exited with 0 +++
.........
2716 +++ exited with 0 +++ 2712 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2716, si_uid=0, si_status=0, si_utime=2, si_stime=70} --- 2666 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=655, si_uid=1000} --- 2666 +++ killed by SIGTERM +++ 2646 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=2666, si_uid=1000, si_status=SIGTERM, si_utime=0, si_stime=0} --- 2720 +++ exited with 0 +++ 2719 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2720, si_uid=0, si_status=0, si_utime=358, si_stime=20} ---
Starting the runner as a foreground process, with run.sh, allowed the job to get past the issue. Not exactly sure why the service version is dying so easily. Perhaps a setting could be changed in the unit file?
What's definitely confusing is that the job ends up marked as cancelled in GitHub Actions, even though runsvc is notified.
Apr 16 13:04:04 gh-runner-03 systemd[1]: gh-actions-runner.service: A process of this unit has been killed by the OOM killer.
Apr 16 13:04:04 gh-runner-03 runsvc.sh[305080]: Shutting down runner listener
Apr 16 13:04:04 gh-runner-03 runsvc.sh[305080]: Sending SIGINT to runner listener to stop
Apr 16 13:04:04 gh-runner-03 runsvc.sh[305080]: Exiting...
Apr 16 13:04:08 gh-runner-03 runsvc.sh[305080]: 2021-04-16 11:04:08Z: Job run completed with result: Canceled
Apr 16 13:04:08 gh-runner-03 runsvc.sh[305080]: Runner listener exited with error code 0
Apr 16 13:04:08 gh-runner-03 runsvc.sh[305080]: Runner listener exit with 0 return code, stop the service, no retry needed.
Apr 16 13:04:08 gh-runner-03 systemd[1]: gh-actions-runner.service: Failed with result 'oom-kill'.
Apr 16 13:04:08 gh-runner-03 systemd[1]: gh-actions-runner.service: Consumed 4h 54min 39.841s CPU time.
Apr 16 13:04:13 gh-runner-03 systemd[1]: gh-actions-runner.service: Scheduled restart job, restart counter is at 5.
Apr 16 13:04:13 gh-runner-03 systemd[1]: Stopped GitHub Actions Runner.
Job run completed with result: Canceled
Should've been failure when a job is not canceled from GitHub Actions, but just killed/terminated on the server, wouldn't you agree? At lest to me job status is reported from the perspective of GH Actions, not from the underlying runners.
Describe the bug We are attempting to test our nodejs application on a self hosted runner in a linux container. Job is exiting (github runner is cancelling the job) prematurely, during a memory intensive portion. The step that we are exiting consistently in, is the setup bash script to install node, node modules, and mongodb.
To Reproduce Steps to reproduce the behavior:
Expected behavior To complete without exiting early.
Runner Version and Platform
2.275.1 - Ubuntu Container
What's not working?
Processes are killed, because oom_score_adj is not able to be edited by github runner.
Job Log Output
Setup Step
Error: The operation was canceled. [debug]System.OperationCanceledException: The operation was canceled.
Runner and Worker's Diagnostic Logs
Worker Logs
Exiting Exception
Caught cancellation exception from step: System.OperationCanceledException: The operation was canceled. at System.Threading.CancellationToken.ThrowOperationCanceledException() at GitHub.Runner.Sdk.ProcessInvoker.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary
2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken) at GitHub.Runner.Common.ProcessInvokerWrapper.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary
2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken) at GitHub.Runner.Worker.Handlers.DefaultStepHost.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary
2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Boolean inheritConsoleHandler, CancellationToken cancellationToken) at GitHub.Runner.Worker.Handlers.ScriptHandler.RunAsync(ActionRunStage stage) at GitHub.Runner.Worker.ActionRunner.RunAsync() at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)`A bunch of these errors, but this is the last one (during the bash setup.sh script)
Which: 'bash' [2021-01-19 00:21:54Z INFO ScriptHandler] Location: '/usr/bin/bash' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Starting process: [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] File name: '/usr/bin/bash' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Arguments: '-e /app/project/_work/_temp/4a28742a-a48e-4b22-9af0-403af0e37076.sh' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Working directory: '/app/project/_work/project/project' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Require exit code zero: 'False' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Encoding web name: ; code page: '' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Force kill process on cancellation: 'False' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Redirected STDIN: 'False' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Persist current code page: 'False' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Keep redirected STDIN open: 'False' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] High priority process: 'False' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Failed to update oom_score_adj for PID: 2506. [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] System.UnauthorizedAccessException: Access to the path '/proc/2506/oom_score_adj' is denied. ---> System.IO.IOException: Permission denied --- End of inner exception stack trace --- at System.IO.FileStream.WriteNative(ReadOnlySpan
1 source) at System.IO.FileStream.FlushWriteBuffer() at System.IO.FileStream.Dispose(Boolean disposing) at System.IO.Stream.Close() at System.IO.StreamWriter.CloseStreamFromDispose(Boolean disposing) at System.IO.StreamWriter.Dispose(Boolean disposing) at System.IO.TextWriter.Dispose() at System.IO.File.WriteAllText(String path, String contents) at GitHub.Runner.Sdk.ProcessInvoker.WriteProcessOomScoreAdj(Int32 processId, Int32 oomScoreAdj)`Example Container Dockerfile `FROM ubuntu:latest
ARG DEBIAN_FRONTEND=noninteractive
ENV GITHUB_RUNNER_VERSION="2.275.1" ENV RUNNER_NAME "runner" ENV GITHUB_OWNER "owner" ENV RUNNER_WORKDIR "_work" ENV GITHUB_REPOSITORY "" ENV GITHUB_PAT ""
RUN apt-get update -y \ && apt-get upgrade -y \ && apt-get install -y \ curl \ sudo \ git \ jq \ systemctl \ tzdata \ mysql-client \ python3-pip \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* \ && useradd -m github \ && usermod -aG sudo github \ && echo "%sudo ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers \ && touch /etc/sudoers.d/github \ && echo "github ALL = (ALL) NOPASSWD: ALL" >> /etc/sudoers.d/github
USER github
WORKDIR /app COPY --chown=github:github entrypoint.sh ./entrypoint.sh RUN sudo chown -R github /app RUN sudo chmod u+x ./entrypoint.sh
ENTRYPOINT [ "/app/entrypoint.sh" ]`
Example entrypoint.sh `#!/bin/sh
mkdir -p /app/${GITHUB_REPOSITORY} cd /app/${GITHUB_REPOSITORY} curl -Ls https://github.com/actions/runner/releases/download/v${GITHUB_RUNNER_VERSION}/actions-runner-linux-x64-${GITHUB_RUNNER_VERSION}.tar.gz | tar xz \ && sudo ./bin/installdependencies.sh
token_url="https://api.github.com/repos/${GITHUB_OWNER}/${GITHUB_REPOSITORY}/actions/runners/registration-token" registration_url="https://github.com/${GITHUB_OWNER}/${GITHUB_REPOSITORY}" echo "Requesting token at '${token_url}'"
payload=$(curl -sX POST -H "Authorization: token ${GITHUB_PAT}" ${token_url}) export RUNNER_TOKEN=$(echo "$payload" | jq .token --raw-output)
./config.sh \ --name "${GITHUB_REPOSITORY}" \ --token "${RUNNER_TOKEN}" \ --url "${registration_url}" \ --works "${RUNNER_WORKDIR}" \ --labels "${RUNNER_LABELS}" \ --unattended \ --replace
echo "adding actions service" sudo ./svc.sh install
echo "starting all services with actions." sudo systemctl start 'actions.'
tail -f /dev/null `
Example workflow.yml `name: nodejs test
on: push: branches:
jobs: Analysis: runs-on: self-hosted
`
Notes: