Pods created by `RunnerSet` are terminated while still running jobs on a rolling update

danxmoran commented 2 years ago

Controller Version

0.24.0

Helm Chart Version

0.19.0

CertManager Version

1.4.1

Deployment Method

Helm

cert-manager installation

Deploying cert-manager Helm chart from https://charts.jetstack.io/

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
[X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
[X] My actions-runner-controller version (v0.x.y) does support the feature
[X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue

Resource Definitions

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
  name: clr-runner
  namespace: actions-runner-groups
spec:
  dockerdWithinRunnerContainer: false
  ephemeral: true
  labels:
  - clr-runner
  organization: color
  replicas: 1
  selector:
    matchLabels:
      app: clr-runner
  serviceName: clr-runner
  template:
    metadata:
      labels:
        app: clr-runner
    spec:
      containers:
      - image: 301643779712.dkr.ecr.us-east-1.amazonaws.com/color-actions-runner:master_c704c14c
        name: runner
        resources:
          limits:
            cpu: 8
            memory: 32Gi
          requests:
            cpu: 8
            memory: 32Gi
      - image: public.ecr.aws/docker/library/docker:dind
        name: docker
      securityContext:
        fsGroup: 1000
      serviceAccountName: actions-runner
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: clr-runner
  namespace: actions-runner-groups
spec:
  maxReplicas: 30
  metrics:
  - scaleDownFactor: "0.7"
    scaleDownThreshold: "0.2"
    scaleUpFactor: "2.5"
    scaleUpThreshold: "0.5"
    type: PercentageRunnersBusy
  minReplicas: 1
  scaleDownDelaySecondsAfterScaleOut: 3600
  scaleTargetRef:
    kind: RunnerSet
    name: clr-runner

To Reproduce

1. Change the `image` for the `runner` container in the manifest above
2. Launch a workflow that starts running some jobs on the runners
3. `kubectl apply` it to update the runners

Describe the bug

All runner pods are restarted within < a minute of the kubectl apply, even those that are still running jobs. The in-flight jobs are dropped and appear hung in the GitHub UI (they eventually time out).

Describe the expected behavior

Pods aren't terminated until they've finished running in-flight jobs

Controller Logs

Will post in separate comment to avoid #1533

Runner Pod Logs

N/A pods are deleted

Additional Context

No response

danxmoran commented 2 years ago

Controller logs: arc.log

danxmoran commented 2 years ago

Dockerfile for the custom runner image we use:

# NOTE: This file originally based on the instructions from
# https://github.com/actions-runner-controller/actions-runner-controller/blob/a62882d243fd8a3cd7901f72b99beeb92cc4e629/runner/actions-runner.dockerfile

FROM 301643779712.dkr.ecr.us-east-1.amazonaws.com/base-python:latest

ARG DUMB_INIT_VERSION=1.2.5
RUN URL="https://github.com/Yelp/dumb-init/releases/download/v${DUMB_INIT_VERSION}/dumb-init_${DUMB_INIT_VERSION}_x86_64" \
    && curl -fLo /usr/local/bin/dumb-init "${URL}" \
    && chmod +x /usr/local/bin/dumb-init

RUN apt-install-packages.sh sudo

ARG DOCKER_CHANNEL=stable
ARG DOCKER_VERSION=20.10.12
RUN URL="https://download.docker.com/linux/static/${DOCKER_CHANNEL}/x86_64/docker-${DOCKER_VERSION}.tgz" \
    && curl -fLo docker.tgz "${URL}" \
    && tar zxvf docker.tgz \
    && install -o root -g root -m 755 docker/docker /usr/local/bin/docker \
    && rm -rf docker docker.tgz \
    && adduser --disabled-password --gecos "" --uid 1000 runner \
    && groupadd docker \
    && usermod -aG docker runner \
    && echo "%sudo   ALL=(ALL:ALL) NOPASSWD:ALL" > /etc/sudoers

ENV HOME=/home/runner
ARG RUNNER_VERSION=2.292.0
ENV RUNNER_ASSETS_DIR=/runnertmp
# hadolint ignore=DL3003
RUN URL="https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-linux-x64-${RUNNER_VERSION}.tar.gz" \
    && mkdir -p "${RUNNER_ASSETS_DIR}" \
    && cd "${RUNNER_ASSETS_DIR}" \
    && curl -fLo runner.tar.gz "${URL}" \
    && tar xzf ./runner.tar.gz \
    && ./bin/installdependencies.sh \
    && mv ./externals ./externalstmp

ENV RUNNER_TOOL_CACHE=/opt/hostedtoolcache
RUN mkdir "${RUNNER_TOOL_CACHE}" \
    && chgrp docker "${RUNNER_TOOL_CACHE}" \
    && chmod g+rwx "${RUNNER_TOOL_CACHE}"

RUN apt-install-packages.sh amazon-ecr-credential-helper \
    && mkdir -p "${HOME}/.docker" \
    && echo '{"credsStore": "ecr-login"}' > "${HOME}/.docker/config.json" \
    && chown -R runner "${HOME}/.docker"

COPY src/infra/docker/snippets/install-awscli.sh /usr/local/bin/
RUN install-awscli.sh

RUN apt-install-packages.sh software-properties-common \
    && add-apt-repository -yn ppa:git-core/ppa \
    && apt-install-packages.sh git-core

ARG ACTIONS_RUNNER_SHA=ac27df83017e4cca65c5a712a6745428c3aec32a
RUN URL_PREFIX="https://raw.githubusercontent.com/actions-runner-controller/actions-runner-controller/${ACTIONS_RUNNER_SHA}/runner" \
    && curl -fLo /usr/local/bin/entrypoint.sh "${URL_PREFIX}/entrypoint.sh" \
    && curl -fLo /usr/local/bin/logger.bash "${URL_PREFIX}/logger.bash" \
    && chmod 755 /usr/local/bin/entrypoint.sh

ENV PATH="${PATH}:${HOME}/.local/bin"
ENV ImageOS=ubuntu20
RUN echo "PATH=${PATH}" > /etc/environment \
    && echo "ImageOS=${ImageOS}" >> /etc/environment

RUN sed -iE 'p; s/^deb /deb-src /' /etc/apt/sources.list \
    && apt-get update \
    && apt-get -y build-dep bedtools \
    && apt-install-packages.sh \
        bedtools \
        build-essential \
        jq \
        libcairo2 \
        libcurl4-openssl-dev \
        libpangocairo-1.0-0 \
        libpq-dev \
        libssl-dev \
        libzbar0 \
        libz-dev \
        postgresql-client

RUN git config --system --add safe.directory /color

USER runner

ENV PANTS_PYTHON_BOOTSTRAP_SEARCH_PATH="['<PATH>']"

ENTRYPOINT ["/usr/local/bin/dumb-init", "--", "entrypoint.sh"]

The base image it builds on top of is essentially python:3.8, but built on Ubuntu instead of Debian

danxmoran commented 2 years ago

The entrypoint.sh used by the image above is pulled from actions-runner-controller/actions-runner-controller, specifically this file.

mumoshu commented 2 years ago

A bunch of Runner has just been unregistered. logs indicate that runner pods are unregistered before being deleted. So there might be issues in other places 🤔 Could you share one of runner pod logs before it gets terminated?

danxmoran commented 2 years ago

This is all that's logged:

2022-06-15 11:56:26.209  DEBUG --- Docker enabled runner detected and Docker daemon wait is enabled
2022-06-15 11:56:26.211  DEBUG --- Waiting until Docker is available or the timeout is reached
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
2022-06-15 11:56:28.297  DEBUG --- Github endpoint URL https://github.com/
sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure an askpass helper
2022-06-15 11:56:28.928  DEBUG --- Passing --ephemeral to config.sh to enable the ephemeral runner.
2022-06-15 11:56:28.930  DEBUG --- Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication

√ Connected to GitHub

# Runner Registration

√ Runner successfully added
√ Runner connection is good

# Runner settings

√ Settings Saved.

2022-06-15 11:56:32.112  DEBUG --- Runner successfully configured.
{
  "agentId": 319901,
  "agentName": "clr-runner-xcn9q-0",
  "poolId": 1,
  "poolName": "Default",
  "ephemeral": true,
  "serverUrl": "https://pipelines.actions.githubusercontent.com/BUIbrOiNvgDy4jIRueFVeJ8VNWJlFlXldijUpKvGCmIkAZfTyc",
  "gitHubUrl": "https://github.com/color",
  "workFolder": "/runner/_work"
}
√ Connected to GitHub

Current runner version: '2.293.0'
2022-06-15 11:56:33Z: Listening for Jobs
2022-06-15 12:01:55Z: Running job: pants-lint-main

Controller logs I see mentioning the Pod I was watching:

2022-06-15T12:03:35Z    DEBUG   actions-runner-controller.runnerpod     Failed to unregister runner before deleting the pod.    {"runnerpod": "actions-runner-groups/clr-runner-cdjkm-0", "error": "failed to remove runner: DELETE https://api.github.com/orgs/color/actions/runners/319900: 422 Bad request - Runner \"clr-runner-cdjkm-0\" is still running a job\" []"}
2022-06-15T12:03:37Z    INFO    actions-runner-controller.runnerpod     Runner pod has been stopped with a successful status.   {"runnerpod": "actions-runner-groups/clr-runner-cdjkm-0"}

danxmoran commented 2 years ago

@mumoshu is there anything else I can do to help get to the bottom of this?

mumoshu commented 2 years ago

@danxmoran I think we have all the info in place. Thanks about that. I'm still wondering how this can happen, as actions/runner is supposed to block SIGTERM sent by Kubernetes until it finishes running the on-going workflow job. Clearly it isn't working like that, hence the issue, right? 🤔

genisd commented 2 years ago

I'd like to refer to my comment in the other thread https://github.com/actions-runner-controller/actions-runner-controller/issues/1581#issuecomment-1181618360

basically that the pod gets the termination signal is not the problem (I think). I would suspect that the terminationGracePeriodSeconds is the setting we need here. Could raise it by default, or set it to the equivalent of the configured duration for the runner.

genisd commented 2 years ago

default: terminationGracePeriodSeconds: 30 (in seconds) :trollface: Not alot of time to finish a job :confused:

That is the kubernetes side of things, mumoshu is expert on the ARC side of things :)

mumoshu commented 2 years ago

@genisd Ah, good catch!

I reread the codebase and I do see terminationGracePeriodSeconds is available under the runner spec. So you might be able to do this:

kind: RunnerDeployment
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 3600

But apparently, we never documented it 😢 At least I can't see anywhere in README that refers to terminationGracePeriodSeconds.

Would you mind giving it a shot, and submitting a pull request to enhance the documentation if it works?

In the simplest form, adding the setting to https://github.com/actions-runner-controller/actions-runner-controller#additional-tweaks would work.

Or perhaps we might better add another section in README for use-cases like using cluster-autoscaler or karpenter and mention terminationGracePeriodSeconds in it? 🤔

genisd commented 2 years ago

I would suggest he have some folks who run into this issue to test it out. I don't know how long the ARC would wait for a termination, if there's no kind of timeout or something on the ARC side of things. But thinking about it, ARC probably doesn't realize or care since in this context it was not the party to initiate the termination.

I'll see if I can add it to the readme. It's my last 2 working days before prolonged leave so I'm trying to wrap some things up :)

danxmoran commented 2 years ago

There was some previous discussion of terminationGracePeriodSeconds in https://github.com/actions-runner-controller/actions-runner-controller/issues/1534#issuecomment-1155858500, where @mumoshu said the problem is that the runner's entrypoint / the actual runner binary don't have any support for graceful handling of SIGTERM.

genisd commented 2 years ago

I just tried it myself. And at least in my setup the docker container immediately started shutting down (which is an issue, meaning we have to prevent k8s from sending unwanted terminations). I use a pod disruption budget of 0 to tell k8s not to mess with the pods.

But in this context, the context of a rolling update, that does not solve it. I think for now there is no solution right?

runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:30.551446221Z" level=info msg="Processing signal 'terminated'"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:30.645944749Z" level=info msg="shim disconnected" id=aebbe107d2acd61d7a922911476645fe1f9f8e9747843dd05357b0ba68d0ef98
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:30.646014019Z" level=warning msg="cleaning up after shim disconnected" id=aebbe107d2acd61d7a922911476645fe1f9f8e9747843dd05357b0ba68d0ef98 namespace=moby
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:30.646036299Z" level=info msg="cleaning up dead shim"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:30.646243979Z" level=info msg="ignoring event" container=aebbe107d2acd61d7a922911476645fe1f9f8e9747843dd05357b0ba68d0ef98 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:30.662603025Z" level=warning msg="cleanup warnings time=\"2022-07-12T12:50:30Z\" level=info msg=\"starting signal loop\" namespace=moby pid=1597 runtime=io.containerd.runc.v2\n"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:32.296036247Z" level=info msg="Layer sha256:8a9132be168b5ef9b61214603e61860d03f476a6bb242484fbf24fd8bb5f3d62 cleaned up"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:32.473740118Z" level=info msg="Layer sha256:9ce9931e1799013b3619cba5edbbf38bfe8cd729d0d360f8292dd81f5e365087 cleaned up"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:32.507492281Z" level=error msg="attach failed with error: error attaching stdout stream: write tcp 127.0.0.1:2376->127.0.0.1:33116: write: broken pipe"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:32.575819948Z" level=info msg="shim disconnected" id=fa7b151af51da8b3f01cd862b1e3743e25a479872cb7ffb217ba7f68781617b4
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:32.575878988Z" level=warning msg="cleaning up after shim disconnected" id=fa7b151af51da8b3f01cd862b1e3743e25a479872cb7ffb217ba7f68781617b4 namespace=moby
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:32.575889748Z" level=info msg="cleaning up dead shim"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:32.575931408Z" level=info msg="ignoring event" container=fa7b151af51da8b3f01cd862b1e3743e25a479872cb7ffb217ba7f68781617b4 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:32.583939261Z" level=warning msg="cleanup warnings time=\"2022-07-12T12:50:32Z\" level=info msg=\"starting signal loop\" namespace=moby pid=1675 runtime=io.containerd.runc.v2\n"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.610351911Z" level=info msg="Container failed to exit within 10s of signal 15 - using the force" container=ef91d888bf2b26db23d3189da22762f4819bda8151205f5310ac58c01fa26d8a
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.655765069Z" level=info msg="shim disconnected" id=ef91d888bf2b26db23d3189da22762f4819bda8151205f5310ac58c01fa26d8a
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.655779980Z" level=info msg="ignoring event" container=ef91d888bf2b26db23d3189da22762f4819bda8151205f5310ac58c01fa26d8a module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.655823060Z" level=warning msg="cleaning up after shim disconnected" id=ef91d888bf2b26db23d3189da22762f4819bda8151205f5310ac58c01fa26d8a namespace=moby
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.655836660Z" level=info msg="cleaning up dead shim"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.661579962Z" level=warning msg="cleanup warnings time=\"2022-07-12T12:50:40Z\" level=info msg=\"starting signal loop\" namespace=moby pid=1700 runtime=io.containerd.runc.v2\n"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.809223291Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.809482791Z" level=info msg="Daemon shutdown complete"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.809503241Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.809527951Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:41.810103420Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///var/run/docker/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix:///var/run/docker/containerd/containerd.sock: timeout\". Reconnecting..." module=grpc
- runner-2-9qnjx-2bhr6
- runner-2-9qnjx-2bhr6

So that would be an obstacle already. Doesn't say anything about the runner itself

mumoshu commented 2 years ago

@danxmoran Hey!

where @mumoshu said the problem is that the runner's entrypoint / the actual runner binary don't have any support for graceful handling of SIGTERM

That's my theory from our observation and not yet the fact. Did you actually see actions/runner doesn't block SIGTERM when it's still running a job? It should be testable only when you have configured a quite large terminationGracePeriodSeconds value.

mumoshu commented 2 years ago

And at least in my setup the docker container immediately started shutting down (which is an issue, meaning we have to prevent k8s from sending unwanted terminations)

@genisd Good finding! We'd definitely need to prevent the docker container from terminating while the runner container is still running a workflow job. A possible way would be to add a prestopHook to the docker sidecar container, so that it runs while sleep 1; do .... until the runner container stops.

I'm still not sure actions/runner itself is handling SIGTERM correctly. But if it does handle it, the whole runner pod will gracefully stop after we add a prestop hook to the docker container.

dan0dbfe commented 2 years ago

@danxmoran Hey!

where @mumoshu said the problem is that the runner's entrypoint / the actual runner binary don't have any support for graceful handling of SIGTERM

That's my theory from our observation and not yet the fact. Did you actually see actions/runner doesn't block SIGTERM when it's still running a job? It should be testable only when you have configured a quite large terminationGracePeriodSeconds value.

Seems like SIGTERM is either treated as a kill signal or converted to SIGINT: https://github.com/actions/runner/search?q=SIGTERM

mumoshu commented 2 years ago

Thanks! And probably this is where actions/runner handles SIGINT? 🤔 https://github.com/actions/runner/blob/1d1998aabb0b27746c126f0bc3fe3f815248632f/src/Runner.Listener/Runner.cs#L316 https://github.com/actions/runner/blob/1d1998aabb0b27746c126f0bc3fe3f815248632f/src/Runner.Common/HostContext.cs#L453 https://github.com/actions/runner/blob/408d6c579c36f0eb318acfdafdcbafc872696501/src/Runner.Worker/JobRunner.cs#L71-L88 https://github.com/actions/runner/blob/7e74f8c9d5bd60f83e2b69479de7a2fb468990fa/src/Runner.Worker/ExecutionContext.cs#L548

It does seem to "enqueue" the user cancellation error to the job context which is still running job steps, and it doesn't seem to be saying anything about how running job steps are treated.

mumoshu commented 2 years ago

Hey everyone! I have an update- #1759 should fix this.

In contrast to RunnerDeployment, RunnerSet-managed runner pods don't have the same controller-side graceful termination logic. That doesn't change in #1759.

However, you can now let the vanilla Kubernetes pod termination process correctly graceful-stop runners. Configure RUNNER_GRACEFUL_STOP_TIMEOUT and terminationGracePeriodSeconds appropriately. More information on the updated REAMDE.

If you're interested in how it's supposed to work, please read the new section in the updated README, and also https://github.com/actions-runner-controller/actions-runner-controller/issues/1581#issuecomment-1229616193.

actions / actions-runner-controller