actions / runner

The Runner for GitHub Actions :rocket:
https://github.com/features/actions
MIT License
4.86k stars 955 forks source link

Self hosted runner in Linux container exits job prematurely #921

Open kabamawutschnik opened 3 years ago

kabamawutschnik commented 3 years ago

Describe the bug We are attempting to test our nodejs application on a self hosted runner in a linux container. Job is exiting (github runner is cancelling the job) prematurely, during a memory intensive portion. The step that we are exiting consistently in, is the setup bash script to install node, node modules, and mongodb.

To Reproduce Steps to reproduce the behavior:

  1. Setup up container using example Dockerfile
  2. Start workflow using example workflow (memory intensive work)
  3. Workflow Exits

Expected behavior To complete without exiting early.

Runner Version and Platform

2.275.1 - Ubuntu Container

What's not working?

Processes are killed, because oom_score_adj is not able to be edited by github runner.

Job Log Output

Setup Step Error: The operation was canceled. [debug]System.OperationCanceledException: The operation was canceled.

Runner and Worker's Diagnostic Logs

Worker Logs

Exiting Exception Caught cancellation exception from step: System.OperationCanceledException: The operation was canceled. at System.Threading.CancellationToken.ThrowOperationCanceledException() at GitHub.Runner.Sdk.ProcessInvoker.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken) at GitHub.Runner.Common.ProcessInvokerWrapper.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken) at GitHub.Runner.Worker.Handlers.DefaultStepHost.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Boolean inheritConsoleHandler, CancellationToken cancellationToken) at GitHub.Runner.Worker.Handlers.ScriptHandler.RunAsync(ActionRunStage stage) at GitHub.Runner.Worker.ActionRunner.RunAsync() at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)`

A bunch of these errors, but this is the last one (during the bash setup.sh script) Which: 'bash' [2021-01-19 00:21:54Z INFO ScriptHandler] Location: '/usr/bin/bash' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Starting process: [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] File name: '/usr/bin/bash' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Arguments: '-e /app/project/_work/_temp/4a28742a-a48e-4b22-9af0-403af0e37076.sh' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Working directory: '/app/project/_work/project/project' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Require exit code zero: 'False' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Encoding web name: ; code page: '' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Force kill process on cancellation: 'False' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Redirected STDIN: 'False' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Persist current code page: 'False' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Keep redirected STDIN open: 'False' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] High priority process: 'False' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Failed to update oom_score_adj for PID: 2506. [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] System.UnauthorizedAccessException: Access to the path '/proc/2506/oom_score_adj' is denied. ---> System.IO.IOException: Permission denied --- End of inner exception stack trace --- at System.IO.FileStream.WriteNative(ReadOnlySpan1 source) at System.IO.FileStream.FlushWriteBuffer() at System.IO.FileStream.Dispose(Boolean disposing) at System.IO.Stream.Close() at System.IO.StreamWriter.CloseStreamFromDispose(Boolean disposing) at System.IO.StreamWriter.Dispose(Boolean disposing) at System.IO.TextWriter.Dispose() at System.IO.File.WriteAllText(String path, String contents) at GitHub.Runner.Sdk.ProcessInvoker.WriteProcessOomScoreAdj(Int32 processId, Int32 oomScoreAdj)`

Example Container Dockerfile `FROM ubuntu:latest

ARG DEBIAN_FRONTEND=noninteractive

ENV GITHUB_RUNNER_VERSION="2.275.1" ENV RUNNER_NAME "runner" ENV GITHUB_OWNER "owner" ENV RUNNER_WORKDIR "_work" ENV GITHUB_REPOSITORY "" ENV GITHUB_PAT ""

RUN apt-get update -y \ && apt-get upgrade -y \ && apt-get install -y \ curl \ sudo \ git \ jq \ systemctl \ tzdata \ mysql-client \ python3-pip \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* \ && useradd -m github \ && usermod -aG sudo github \ && echo "%sudo ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers \ && touch /etc/sudoers.d/github \ && echo "github ALL = (ALL) NOPASSWD: ALL" >> /etc/sudoers.d/github

USER github

WORKDIR /app COPY --chown=github:github entrypoint.sh ./entrypoint.sh RUN sudo chown -R github /app RUN sudo chmod u+x ./entrypoint.sh

ENTRYPOINT [ "/app/entrypoint.sh" ]`

Example entrypoint.sh `#!/bin/sh

mkdir -p /app/${GITHUB_REPOSITORY} cd /app/${GITHUB_REPOSITORY} curl -Ls https://github.com/actions/runner/releases/download/v${GITHUB_RUNNER_VERSION}/actions-runner-linux-x64-${GITHUB_RUNNER_VERSION}.tar.gz | tar xz \ && sudo ./bin/installdependencies.sh

token_url="https://api.github.com/repos/${GITHUB_OWNER}/${GITHUB_REPOSITORY}/actions/runners/registration-token" registration_url="https://github.com/${GITHUB_OWNER}/${GITHUB_REPOSITORY}" echo "Requesting token at '${token_url}'"

payload=$(curl -sX POST -H "Authorization: token ${GITHUB_PAT}" ${token_url}) export RUNNER_TOKEN=$(echo "$payload" | jq .token --raw-output)

./config.sh \ --name "${GITHUB_REPOSITORY}" \ --token "${RUNNER_TOKEN}" \ --url "${registration_url}" \ --works "${RUNNER_WORKDIR}" \ --labels "${RUNNER_LABELS}" \ --unattended \ --replace

echo "adding actions service" sudo ./svc.sh install

echo "starting all services with actions." sudo systemctl start 'actions.'

tail -f /dev/null `

Example workflow.yml `name: nodejs test

on: push: branches:

jobs: Analysis: runs-on: self-hosted

env:
  CI: true
  USER: github

strategy:
  matrix:
    node-version: [ 10.19.0 ]
steps:
  - name: Ensure Dependencies
    run: |
      sudo apt update
      sudo apt install -y gcc
      sudo apt install -y make
      sudo apt install -y lsof
      sudo apt install -y curl
      sudo apt install -y psmisc
      sudo apt install -y node-gyp
  - name: Ensure Known Hosts
    run: |
      mkdir -p ~/.ssh/
      touch ~/.ssh/known_hosts
  - name: Pull Repo
    uses: actions/checkout@v2
  - name: Use Node
    uses: actions/setup-node@v1
    with:
      node-version: ${{ matrix.node-version }}
  - name: Run Setup.sh
    run: |
      export CI=true
      ./scripts/setup.sh
  - name: Run Tests
    run: sudo ./scripts/run_all_tests.sh
  - name: Extra Logs if Failed
    if: ${{ failure() }}
    run: |
      npm -v
      node -v
      npm config ls -l
      cat ~/.npm/_logs/*

`

Notes:

TingluoHuang commented 3 years ago

@kabamawutschnik have you check syslog to confirm the runner get killed by OOM Killer? grep -i kill /var/log/syslog

You can also manually modify the oom_score_adj for the runner process before it takes any jobs.

kabamawutschnik commented 3 years ago

@TingluoHuang I had to install rsyslog and start it (with a default conf), to start getting syslog. After doing so, I ran the workflow again and ran your grep command. I did not find any results.

I set the oom_score_adj for the github runner service, to -999. Re running the workflow again, resulted in the same job cancellation.

I forgot to note: The exact same workflow runs to completion in a Github hosted runner.

TingluoHuang commented 3 years ago

@kabamawutschnik do you mind share the full runner diag logs for both runner.listener and runner.worker?

The runner cancel jobs only when

kabamawutschnik commented 3 years ago

I believe these are the ones.

Worker_20210119-002009-utc.log Runner_20210119-002002-utc.log

TingluoHuang commented 3 years ago
[2021-01-19 00:21:54Z INFO Runner] Received Ctrl-C signal, stop Runner.Listener and Runner.Worker.
[2021-01-19 00:21:54Z INFO HostContext] Runner will be shutdown for UserCancelled

@kabamawutschnik something sends a Ctrl-C to the runner while it's running.

kabamawutschnik commented 3 years ago

@TingluoHuang Strange.. The exact same workflow runs fully on a github hosted runner. Are github hosted runners some how more resilient to kill commands?

Searching through setup.sh I found a location where we run killall mongod && killall mongos && killall -9 mongos. The job appears to end just after this portion of code.

kabamawutschnik commented 3 years ago

I did some testing.

I commented out a bunch of ./bin/mongo calls, and made it past the same portion of code that I was exiting on. Job is still exiting the same way, but seems to progress further.

Could this be a result of child process management?

TingluoHuang commented 3 years ago

@kabamawutschnik try to figure out what process sends the SIGINT?

https://unix.stackexchange.com/a/372581

kabamawutschnik commented 3 years ago

Attached to process 655 (worker process), & tracing all the child processes of 655. Looks like the worker sends SIGINT then SIGTERM.

Note: I notice that our github runner service dies after the exited job. I need to restart it with sudo systemctl start 'actions.servicename'

2671 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2676, si_uid=1000, si_status=0, si_utime=0, si_stime=0} --- 2679 +++ exited with 0 +++ 2666 --- SIGINT {si_signo=SIGINT, si_code=SI_USER, si_pid=655, si_uid=1000} --- 2680 +++ exited with 0 +++ 2678 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2679, si_uid=1000, si_status=0, si_utime=0, si_stime=0} --- 2678 +++ exited with 0 +++

.........

2716 +++ exited with 0 +++ 2712 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2716, si_uid=0, si_status=0, si_utime=2, si_stime=70} --- 2666 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=655, si_uid=1000} --- 2666 +++ killed by SIGTERM +++ 2646 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=2666, si_uid=1000, si_status=SIGTERM, si_utime=0, si_stime=0} --- 2720 +++ exited with 0 +++ 2719 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2720, si_uid=0, si_status=0, si_utime=358, si_stime=20} ---

strace.log

kabamawutschnik commented 3 years ago

Starting the runner as a foreground process, with run.sh, allowed the job to get past the issue. Not exactly sure why the service version is dying so easily. Perhaps a setting could be changed in the unit file?

MartinNowak commented 3 years ago

What's definitely confusing is that the job ends up marked as cancelled in GitHub Actions, even though runsvc is notified.

Apr 16 13:04:04 gh-runner-03 systemd[1]: gh-actions-runner.service: A process of this unit has been killed by the OOM killer.
Apr 16 13:04:04 gh-runner-03 runsvc.sh[305080]: Shutting down runner listener
Apr 16 13:04:04 gh-runner-03 runsvc.sh[305080]: Sending SIGINT to runner listener to stop
Apr 16 13:04:04 gh-runner-03 runsvc.sh[305080]: Exiting...
Apr 16 13:04:08 gh-runner-03 runsvc.sh[305080]: 2021-04-16 11:04:08Z: Job run completed with result: Canceled
Apr 16 13:04:08 gh-runner-03 runsvc.sh[305080]: Runner listener exited with error code 0
Apr 16 13:04:08 gh-runner-03 runsvc.sh[305080]: Runner listener exit with 0 return code, stop the service, no retry needed.
Apr 16 13:04:08 gh-runner-03 systemd[1]: gh-actions-runner.service: Failed with result 'oom-kill'.
Apr 16 13:04:08 gh-runner-03 systemd[1]: gh-actions-runner.service: Consumed 4h 54min 39.841s CPU time.
Apr 16 13:04:13 gh-runner-03 systemd[1]: gh-actions-runner.service: Scheduled restart job, restart counter is at 5.
Apr 16 13:04:13 gh-runner-03 systemd[1]: Stopped GitHub Actions Runner.

Job run completed with result: Canceled

Should've been failure when a job is not canceled from GitHub Actions, but just killed/terminated on the server, wouldn't you agree? At lest to me job status is reported from the perspective of GH Actions, not from the underlying runners.