Open danxmoran opened 2 years ago
Dockerfile
for the custom runner image we use:
# NOTE: This file originally based on the instructions from
# https://github.com/actions-runner-controller/actions-runner-controller/blob/a62882d243fd8a3cd7901f72b99beeb92cc4e629/runner/actions-runner.dockerfile
FROM 301643779712.dkr.ecr.us-east-1.amazonaws.com/base-python:latest
ARG DUMB_INIT_VERSION=1.2.5
RUN URL="https://github.com/Yelp/dumb-init/releases/download/v${DUMB_INIT_VERSION}/dumb-init_${DUMB_INIT_VERSION}_x86_64" \
&& curl -fLo /usr/local/bin/dumb-init "${URL}" \
&& chmod +x /usr/local/bin/dumb-init
RUN apt-install-packages.sh sudo
ARG DOCKER_CHANNEL=stable
ARG DOCKER_VERSION=20.10.12
RUN URL="https://download.docker.com/linux/static/${DOCKER_CHANNEL}/x86_64/docker-${DOCKER_VERSION}.tgz" \
&& curl -fLo docker.tgz "${URL}" \
&& tar zxvf docker.tgz \
&& install -o root -g root -m 755 docker/docker /usr/local/bin/docker \
&& rm -rf docker docker.tgz \
&& adduser --disabled-password --gecos "" --uid 1000 runner \
&& groupadd docker \
&& usermod -aG docker runner \
&& echo "%sudo ALL=(ALL:ALL) NOPASSWD:ALL" > /etc/sudoers
ENV HOME=/home/runner
ARG RUNNER_VERSION=2.292.0
ENV RUNNER_ASSETS_DIR=/runnertmp
# hadolint ignore=DL3003
RUN URL="https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-linux-x64-${RUNNER_VERSION}.tar.gz" \
&& mkdir -p "${RUNNER_ASSETS_DIR}" \
&& cd "${RUNNER_ASSETS_DIR}" \
&& curl -fLo runner.tar.gz "${URL}" \
&& tar xzf ./runner.tar.gz \
&& ./bin/installdependencies.sh \
&& mv ./externals ./externalstmp
ENV RUNNER_TOOL_CACHE=/opt/hostedtoolcache
RUN mkdir "${RUNNER_TOOL_CACHE}" \
&& chgrp docker "${RUNNER_TOOL_CACHE}" \
&& chmod g+rwx "${RUNNER_TOOL_CACHE}"
RUN apt-install-packages.sh amazon-ecr-credential-helper \
&& mkdir -p "${HOME}/.docker" \
&& echo '{"credsStore": "ecr-login"}' > "${HOME}/.docker/config.json" \
&& chown -R runner "${HOME}/.docker"
COPY src/infra/docker/snippets/install-awscli.sh /usr/local/bin/
RUN install-awscli.sh
RUN apt-install-packages.sh software-properties-common \
&& add-apt-repository -yn ppa:git-core/ppa \
&& apt-install-packages.sh git-core
ARG ACTIONS_RUNNER_SHA=ac27df83017e4cca65c5a712a6745428c3aec32a
RUN URL_PREFIX="https://raw.githubusercontent.com/actions-runner-controller/actions-runner-controller/${ACTIONS_RUNNER_SHA}/runner" \
&& curl -fLo /usr/local/bin/entrypoint.sh "${URL_PREFIX}/entrypoint.sh" \
&& curl -fLo /usr/local/bin/logger.bash "${URL_PREFIX}/logger.bash" \
&& chmod 755 /usr/local/bin/entrypoint.sh
ENV PATH="${PATH}:${HOME}/.local/bin"
ENV ImageOS=ubuntu20
RUN echo "PATH=${PATH}" > /etc/environment \
&& echo "ImageOS=${ImageOS}" >> /etc/environment
RUN sed -iE 'p; s/^deb /deb-src /' /etc/apt/sources.list \
&& apt-get update \
&& apt-get -y build-dep bedtools \
&& apt-install-packages.sh \
bedtools \
build-essential \
jq \
libcairo2 \
libcurl4-openssl-dev \
libpangocairo-1.0-0 \
libpq-dev \
libssl-dev \
libzbar0 \
libz-dev \
postgresql-client
RUN git config --system --add safe.directory /color
USER runner
ENV PANTS_PYTHON_BOOTSTRAP_SEARCH_PATH="['<PATH>']"
ENTRYPOINT ["/usr/local/bin/dumb-init", "--", "entrypoint.sh"]
The base image it builds on top of is essentially python:3.8
, but built on Ubuntu instead of Debian
The entrypoint.sh
used by the image above is pulled from actions-runner-controller/actions-runner-controller
, specifically this file.
A bunch of Runner has just been unregistered.
logs indicate that runner pods are unregistered before being deleted. So there might be issues in other places 🤔
Could you share one of runner pod logs before it gets terminated?
This is all that's logged:
2022-06-15 11:56:26.209 DEBUG --- Docker enabled runner detected and Docker daemon wait is enabled
2022-06-15 11:56:26.211 DEBUG --- Waiting until Docker is available or the timeout is reached
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
2022-06-15 11:56:28.297 DEBUG --- Github endpoint URL https://github.com/
sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure an askpass helper
2022-06-15 11:56:28.928 DEBUG --- Passing --ephemeral to config.sh to enable the ephemeral runner.
2022-06-15 11:56:28.930 DEBUG --- Configuring the runner.
--------------------------------------------------------------------------------
| ____ _ _ _ _ _ _ _ _ |
| / ___(_) |_| | | |_ _| |__ / \ ___| |_(_) ___ _ __ ___ |
| | | _| | __| |_| | | | | '_ \ / _ \ / __| __| |/ _ \| '_ \/ __| |
| | |_| | | |_| _ | |_| | |_) | / ___ \ (__| |_| | (_) | | | \__ \ |
| \____|_|\__|_| |_|\__,_|_.__/ /_/ \_\___|\__|_|\___/|_| |_|___/ |
| |
| Self-hosted runner registration |
| |
--------------------------------------------------------------------------------
# Authentication
√ Connected to GitHub
# Runner Registration
√ Runner successfully added
√ Runner connection is good
# Runner settings
√ Settings Saved.
2022-06-15 11:56:32.112 DEBUG --- Runner successfully configured.
{
"agentId": 319901,
"agentName": "clr-runner-xcn9q-0",
"poolId": 1,
"poolName": "Default",
"ephemeral": true,
"serverUrl": "https://pipelines.actions.githubusercontent.com/BUIbrOiNvgDy4jIRueFVeJ8VNWJlFlXldijUpKvGCmIkAZfTyc",
"gitHubUrl": "https://github.com/color",
"workFolder": "/runner/_work"
}
√ Connected to GitHub
Current runner version: '2.293.0'
2022-06-15 11:56:33Z: Listening for Jobs
2022-06-15 12:01:55Z: Running job: pants-lint-main
Controller logs I see mentioning the Pod I was watching:
2022-06-15T12:03:35Z DEBUG actions-runner-controller.runnerpod Failed to unregister runner before deleting the pod. {"runnerpod": "actions-runner-groups/clr-runner-cdjkm-0", "error": "failed to remove runner: DELETE https://api.github.com/orgs/color/actions/runners/319900: 422 Bad request - Runner \"clr-runner-cdjkm-0\" is still running a job\" []"}
2022-06-15T12:03:37Z INFO actions-runner-controller.runnerpod Runner pod has been stopped with a successful status. {"runnerpod": "actions-runner-groups/clr-runner-cdjkm-0"}
@mumoshu is there anything else I can do to help get to the bottom of this?
@danxmoran I think we have all the info in place. Thanks about that. I'm still wondering how this can happen, as actions/runner
is supposed to block SIGTERM
sent by Kubernetes until it finishes running the on-going workflow job. Clearly it isn't working like that, hence the issue, right? 🤔
I'd like to refer to my comment in the other thread https://github.com/actions-runner-controller/actions-runner-controller/issues/1581#issuecomment-1181618360
basically that the pod gets the termination signal is not the problem (I think).
I would suspect that the terminationGracePeriodSeconds
is the setting we need here.
Could raise it by default, or set it to the equivalent of the configured duration
for the runner.
default: terminationGracePeriodSeconds: 30
(in seconds) :trollface:
Not alot of time to finish a job :confused:
That is the kubernetes side of things, mumoshu is expert on the ARC side of things :)
@genisd Ah, good catch!
I reread the codebase and I do see terminationGracePeriodSeconds
is available under the runner spec. So you might be able to do this:
kind: RunnerDeployment
spec:
template:
spec:
terminationGracePeriodSeconds: 3600
But apparently, we never documented it 😢 At least I can't see anywhere in README that refers to terminationGracePeriodSeconds
.
Would you mind giving it a shot, and submitting a pull request to enhance the documentation if it works?
In the simplest form, adding the setting to https://github.com/actions-runner-controller/actions-runner-controller#additional-tweaks would work.
Or perhaps we might better add another section in README for use-cases like using cluster-autoscaler or karpenter and mention terminationGracePeriodSeconds
in it? 🤔
I would suggest he have some folks who run into this issue to test it out. I don't know how long the ARC would wait for a termination, if there's no kind of timeout or something on the ARC side of things. But thinking about it, ARC probably doesn't realize or care since in this context it was not the party to initiate the termination.
I'll see if I can add it to the readme. It's my last 2 working days before prolonged leave so I'm trying to wrap some things up :)
There was some previous discussion of terminationGracePeriodSeconds
in https://github.com/actions-runner-controller/actions-runner-controller/issues/1534#issuecomment-1155858500, where @mumoshu said the problem is that the runner's entrypoint / the actual runner
binary don't have any support for graceful handling of SIGTERM
.
I just tried it myself. And at least in my setup the docker container immediately started shutting down (which is an issue, meaning we have to prevent k8s from sending unwanted terminations). I use a pod disruption budget of 0 to tell k8s not to mess with the pods.
But in this context, the context of a rolling update, that does not solve it. I think for now there is no solution right?
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:30.551446221Z" level=info msg="Processing signal 'terminated'"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:30.645944749Z" level=info msg="shim disconnected" id=aebbe107d2acd61d7a922911476645fe1f9f8e9747843dd05357b0ba68d0ef98
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:30.646014019Z" level=warning msg="cleaning up after shim disconnected" id=aebbe107d2acd61d7a922911476645fe1f9f8e9747843dd05357b0ba68d0ef98 namespace=moby
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:30.646036299Z" level=info msg="cleaning up dead shim"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:30.646243979Z" level=info msg="ignoring event" container=aebbe107d2acd61d7a922911476645fe1f9f8e9747843dd05357b0ba68d0ef98 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:30.662603025Z" level=warning msg="cleanup warnings time=\"2022-07-12T12:50:30Z\" level=info msg=\"starting signal loop\" namespace=moby pid=1597 runtime=io.containerd.runc.v2\n"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:32.296036247Z" level=info msg="Layer sha256:8a9132be168b5ef9b61214603e61860d03f476a6bb242484fbf24fd8bb5f3d62 cleaned up"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:32.473740118Z" level=info msg="Layer sha256:9ce9931e1799013b3619cba5edbbf38bfe8cd729d0d360f8292dd81f5e365087 cleaned up"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:32.507492281Z" level=error msg="attach failed with error: error attaching stdout stream: write tcp 127.0.0.1:2376->127.0.0.1:33116: write: broken pipe"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:32.575819948Z" level=info msg="shim disconnected" id=fa7b151af51da8b3f01cd862b1e3743e25a479872cb7ffb217ba7f68781617b4
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:32.575878988Z" level=warning msg="cleaning up after shim disconnected" id=fa7b151af51da8b3f01cd862b1e3743e25a479872cb7ffb217ba7f68781617b4 namespace=moby
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:32.575889748Z" level=info msg="cleaning up dead shim"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:32.575931408Z" level=info msg="ignoring event" container=fa7b151af51da8b3f01cd862b1e3743e25a479872cb7ffb217ba7f68781617b4 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:32.583939261Z" level=warning msg="cleanup warnings time=\"2022-07-12T12:50:32Z\" level=info msg=\"starting signal loop\" namespace=moby pid=1675 runtime=io.containerd.runc.v2\n"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.610351911Z" level=info msg="Container failed to exit within 10s of signal 15 - using the force" container=ef91d888bf2b26db23d3189da22762f4819bda8151205f5310ac58c01fa26d8a
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.655765069Z" level=info msg="shim disconnected" id=ef91d888bf2b26db23d3189da22762f4819bda8151205f5310ac58c01fa26d8a
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.655779980Z" level=info msg="ignoring event" container=ef91d888bf2b26db23d3189da22762f4819bda8151205f5310ac58c01fa26d8a module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.655823060Z" level=warning msg="cleaning up after shim disconnected" id=ef91d888bf2b26db23d3189da22762f4819bda8151205f5310ac58c01fa26d8a namespace=moby
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.655836660Z" level=info msg="cleaning up dead shim"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.661579962Z" level=warning msg="cleanup warnings time=\"2022-07-12T12:50:40Z\" level=info msg=\"starting signal loop\" namespace=moby pid=1700 runtime=io.containerd.runc.v2\n"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.809223291Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.809482791Z" level=info msg="Daemon shutdown complete"
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.809503241Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:40.809527951Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
runner-2-9qnjx-2bhr6 docker time="2022-07-12T12:50:41.810103420Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///var/run/docker/containerd/containerd.sock <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix:///var/run/docker/containerd/containerd.sock: timeout\". Reconnecting..." module=grpc
- runner-2-9qnjx-2bhr6
- runner-2-9qnjx-2bhr6
So that would be an obstacle already. Doesn't say anything about the runner itself
@danxmoran Hey!
where @mumoshu said the problem is that the runner's entrypoint / the actual runner binary don't have any support for graceful handling of SIGTERM
That's my theory from our observation and not yet the fact. Did you actually see actions/runner
doesn't block SIGTERM when it's still running a job? It should be testable only when you have configured a quite large terminationGracePeriodSeconds
value.
And at least in my setup the docker container immediately started shutting down (which is an issue, meaning we have to prevent k8s from sending unwanted terminations)
@genisd Good finding! We'd definitely need to prevent the docker container from terminating while the runner
container is still running a workflow job. A possible way would be to add a prestopHook
to the docker
sidecar container, so that it runs while sleep 1; do ....
until the runner
container stops.
I'm still not sure actions/runner
itself is handling SIGTERM correctly. But if it does handle it, the whole runner pod will gracefully stop after we add a prestop hook to the docker
container.
@danxmoran Hey!
where @mumoshu said the problem is that the runner's entrypoint / the actual runner binary don't have any support for graceful handling of SIGTERM
That's my theory from our observation and not yet the fact. Did you actually see
actions/runner
doesn't block SIGTERM when it's still running a job? It should be testable only when you have configured a quite largeterminationGracePeriodSeconds
value.
Seems like SIGTERM is either treated as a kill signal or converted to SIGINT: https://github.com/actions/runner/search?q=SIGTERM
Thanks! And probably this is where actions/runner handles SIGINT? 🤔 https://github.com/actions/runner/blob/1d1998aabb0b27746c126f0bc3fe3f815248632f/src/Runner.Listener/Runner.cs#L316 https://github.com/actions/runner/blob/1d1998aabb0b27746c126f0bc3fe3f815248632f/src/Runner.Common/HostContext.cs#L453 https://github.com/actions/runner/blob/408d6c579c36f0eb318acfdafdcbafc872696501/src/Runner.Worker/JobRunner.cs#L71-L88 https://github.com/actions/runner/blob/7e74f8c9d5bd60f83e2b69479de7a2fb468990fa/src/Runner.Worker/ExecutionContext.cs#L548
It does seem to "enqueue" the user cancellation error to the job context which is still running job steps, and it doesn't seem to be saying anything about how running job steps are treated.
Hey everyone! I have an update- #1759 should fix this.
In contrast to RunnerDeployment, RunnerSet-managed runner pods don't have the same controller-side graceful termination logic. That doesn't change in #1759.
However, you can now let the vanilla Kubernetes pod termination process correctly graceful-stop runners. Configure RUNNER_GRACEFUL_STOP_TIMEOUT
and terminationGracePeriodSeconds
appropriately. More information on the updated REAMDE.
If you're interested in how it's supposed to work, please read the new section in the updated README, and also https://github.com/actions-runner-controller/actions-runner-controller/issues/1581#issuecomment-1229616193.
Controller Version
0.24.0
Helm Chart Version
0.19.0
CertManager Version
1.4.1
Deployment Method
Helm
cert-manager installation
Deploying
cert-manager
Helm chart from https://charts.jetstack.io/Checks
Resource Definitions
To Reproduce
Describe the bug
All runner pods are restarted within < a minute of the
kubectl apply
, even those that are still running jobs. The in-flight jobs are dropped and appear hung in the GitHub UI (they eventually time out).Describe the expected behavior
Pods aren't terminated until they've finished running in-flight jobs
Controller Logs
Runner Pod Logs
Additional Context
No response