Closed tbomberg closed 2 years ago
@tbomberg Would you mind sharing output from kubectl get po $POD
for any runner pod that is managed by RunnerSet in your cluster? I think it's missing the said envvar and that's causing it. We need to figure out why the runner pod result in such state, because any any runner pods managed by StatefulSet created by RunnerSet would have StatefulSet
created by RunnerSet
should have a pod template that sets a RUNNER_SET
envvar hence any runner pod should have it too.RUNNER_SET
envvar thanks to ARC's mutating webhook.
@mumoshu Yes, that is the actual difference.
While the StatefulSets
created from the 0.22.0 controller look the same like the one from the 0.22.1 controller. (The env-vars RUNNER_NAME
and RUNNER_TOKEN
are in neither of the STS pod templates),
when I compare the Pods I can see that the needed env-vars are injected into the Pods only when to 0.22.0 controller is active.
This is the complete environment section of the Pod from the 0.22.1 controller:
containers:
- env:
- name: RUNNER_ORG
value: our-organization
- name: RUNNER_REPO
- name: RUNNER_ENTERPRISE
- name: RUNNER_LABELS
value: vpc-devservices-runnerset
- name: RUNNER_GROUP
- name: DOCKER_ENABLED
value: "true"
- name: DOCKERD_IN_RUNNER
value: "false"
- name: GITHUB_URL
value: https://github.com/
- name: RUNNER_WORKDIR
value: /runner/_work
- name: RUNNER_EPHEMERAL
value: "false"
- name: DOCKER_HOST
value: tcp://localhost:2376
- name: DOCKER_TLS_VERIFY
value: "1"
- name: DOCKER_CERT_PATH
value: /certs/client
- name: RUNNER_FEATURE_FLAG_EPHEMERAL
value: "true"
It looks like the controller was injecting these vars directly into the pods, not via the sts template, but now this injections is not happening or not complete.
I will attach the complete manifests for both StatefulSet and Pod as files:
runner-sts-pod-0.22.0.yaml.txt runner-sts-pod-0.22.1.yaml.txt runner-sts-0.22.0.yaml.txt runner-sts-0.22.1.yaml.txt .
@tbomberg Hey! Thanks a lot for your detailed report. It did help investigate it fully.
So, this seems happening due to a "fix" on the chart. You'd need to update your values.yaml to accommodate that.
Here's the excerpt from your helm chart values:
scope:
singleNamespace: true
Until chart v0.17.0, there was a bug in the chart that multiple instances of ARC interfere with each other on mutating and validating admission webhooks, even if you specified scope.singleNamespace
.
In v0.17.1, we fixed it by including watchNamespace
into the namespaceSelector
s in admission webhook configs
.
Do you actually have multiple instances of ARC on your cluster? Do you really need to restrict ARC's watch namespace? If not, you can just omit scope.sigleNamespace
out of your values.yaml so that everything would start working again.
Otherwise, add scope.watchNamespace to your values.yaml. Assuming your only namespace that contains RunnerSet(hence statefulsets and runner pods) is asys-vpc-github-runner
, it should look like:
scope:
singleNamespace: true
watchNamespace: asys-vpc-github-runner
Thanks for the investigation and the point to the area of the problem.
I tested this on my environment:
singleNamespace
: true and adding the namespace to watchNamespace
like suggested has no effect.
The RunnerPods still do not get the RUNNER_NAME
injected.I tested this in [0.17.1
, 0.17.2
and 0.17.3
]
I can live with singleNamespace: false
but it looks like this is not working for RunnerSets
as expected.
The RunnerPods still do not get the RUNNER_NAME injected
Did you recreate the runner pod after you upgraded your helm release?
The envvar is injected by the mutating webhook which basically means you need to recreate any "broken" runner pods (by kubectl delete
ing the broken pods) to let it inject the runner name.
Did you recreate the runner pod after you upgraded your helm release? The envvar is injected by the mutating webhook which basically means you need to recreate any "broken" runner pods (by
kubectl delete
ing the broken pods) to let it inject the runner name.
Yes, I deleted the RunnerSet
, uninstalled ARC, reinstalled it with the configuration to test and let the newly configured controller create the StatefulSets
and Pods
on its own.
@tbomberg Thanks for the info! Hmm, well it seems impossible. Would you also mind sharing your mutatingwebhookconfiguration that is installed via helm? I'm especially interested in if it has a correct namespace selector (for asys-vpc-github-runner
as you seem to have runnersets and runner pods in that namespace). If it's there, the only cause would be your k8s cluster is somehow broken and not respecting the mutating webhook. If it isn't there, almost certainly you've missed something while upgrading the helm chart.
For me the generated namespaceSelectors
look just fine:
mutatingwebhookconfig.yaml.txt
I updated the way to reproduce the problem to using
so it is easy to reproduce everywhere
Steps to reproduce
kind create cluster
helm repo add jetstack https://charts.jetstack.io
helm repo add actions-runner-controller https://actions-runner-controller.github.io/actions-runner-controller
helm repo update
# Install Cert-Manager
helm upgrade --install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.8.0 \
--set installCRDs=true
kubectl create namespace actions-runner-system
# Install the ARC App: https://github.com/settings/apps/new?url=http://github.com/actions-runner-controller/actions-runner-controller&webhook_active=false&public=false&administration=write&actions=read
# Obtain APP_ID INSTALLATION_ID and PRIVATE_KEY_FILE_PATH
# Create Secret
kubectl create secret generic controller-manager \
-n actions-runner-system \
--from-literal=github_app_id=${APP_ID} \
--from-literal=github_app_installation_id=${INSTALLATION_ID} \
--from-file=github_app_private_key=${PRIVATE_KEY_FILE_PATH}
helm upgrade --install --namespace actions-runner-system --create-namespace \
--wait actions-runner-controller actions-runner-controller/actions-runner-controller \
--set scope.singleNamespace=true --set scope.watchNamespace=actions-runner-system
export ARC_RUNNER_REPOSITORY=tbomberg/test-arc
cat <<EOF >kind-runnerset.yaml
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
name: kind-runnerset
namespace: actions-runner-system
spec:
ephemeral: false
repository: ${ARC_RUNNER_REPOSITORY}
labels:
- kind-runnerset
replicas: 1
selector:
matchLabels:
app: kind-runnerset
serviceName: kind-runnerset
template:
metadata:
labels:
app: kind-runnerset
EOF
kubectl apply -f kind-runnerset.yaml
# Catch the logs from the runner pod:
kubectl -n actions-runner-system logs -l app=kind-runnerset -c runner -f
You have to be quick to get the logs from the pod
. Since 0.17.3 the StatefulSet
and Pod
is immediately removed and replaced upon any failure which makes debugging not easy.
Also in this setup I get:
$ (kind-kind:default) kubectl -n actions-runner-system logs -l app=kind-runnerset -c runner -f
2022-04-13 10:34:03.322 DEBUG --- Docker enabled runner detected and Docker daemon wait is enabled
2022-04-13 10:34:03.327 DEBUG --- Waiting until Docker is available or the timeout is reached
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
unable to resolve docker endpoint: open /certs/client/ca.pem: no such file or directory
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
2022-04-13 10:34:11.117 DEBUG --- Github endpoint URL https://github.com/
2022-04-13 10:34:11.121 ERROR --- RUNNER_NAME must be set
--set scope.watchNamespace=actions-runner-system
seems definitely wrong. Assuming your RunnerDeployment had namespace of asys-vpc-github-runner
, it should be --set scope.watchNamespace=asys-vpc-github-runner
As I wrote I change my setup to a kind cluster staying as much at the defaults as possible to reproduce.
I just want to keep my working setup running without having to tear it down for every question asked.
Thanks for that! But the fix here would be to change the helm chart value as I said.
The rationale is that since 0.22.0 RunnerDeployment relies on mutating webhook and the same runner pod management logic that backs RunnerSet, which requires RUNNER_NAME envvar to be set. watchNamespace
needs to be configured properly to make the mutating webhook work, hence your issue.
In 0.21.x and below, RunnerDeployment didn't depend on the mutating webhook so that's why it worked with your wrong watchNamespace setting.
To what should I set the watchNamespace?
controller and runners are located in the very same namespace, here now: actions-runner-system
Is this not supported? I can check if separate namespaces make any difference.
It does not make a difference. After reconfiguring the controller to watch a separate namespace and trying to setup a RunnerSet
there shows still the problem.
@tbomberg Ah sorry I missed that you changed the namespace of the example RunnerSet to actions-runner-system
. It should align with scope.watchNamespace
and the namespaceSelector
of mutatingwebhookconfig so almost everything looks good now.
The last missing piece, which I just noticed, might be that we might have missed labeling the actions-runner-system
namespace with name=actions-runner-system
.
Rereading your mutatingwebhookconfig:
namespaceSelector:
matchLabels:
name: actions-runner-system
This says that the namespace must be labeled with the name
key, and a helm chart is unable to modify existing namespace to have that label so it must be done by you. Try kubectl label ns actions-runner-system name=actions-runner-system
.
I wish there was any way to let a mutating webhook match the target namespace by name, but apparently there's no way 😢
@mumoshu Yes, i can confirm that all this was caused by the missing name label on the namespace
@mumoshu I just wanted to inform that this was not fixed in 0.24.1 or 0.25.2 -- in both cases I tried to upgrade from 0.21.1 to 0.25.2 after updating all crds and even wiping out my entire cluster of crds and uninstalling everything and installing from scratch to make sure.
I finally was able to stop my stateful set from constantly restarting after running:
kubectl label ns actions-runner-system name=actions-runner-system
@rxa313 Thanks for reporting! Yes, neither ARC nor the chart labels the namespace automatically so I believe this is where we need to update the documentation (of perhaps our chart, next to the description for the watchNamespace
and singleNamespace
Hitting this issue as well.
I am having multiple controller in same cluster with name space cicd--ci and cicd--cd as well as the the RunnetSet in same name spaces.
After setting the watchNamespace
and set the namespace properly.
k get namespace cicd--cd --show-labels
NAME STATUS AGE LABELS
cicd--cd Active 38d kubernetes.io/metadata.name=cicd--cd,name=cicd--cd
For RunnerDeployment it works fine, however, not for RunnerSet. Hitting same issues as missing RUNNER_NAME and RUNNER_TOKEN. Any suggestion for multi controller in one cluster runs?
I wish there was any way to let a mutating webhook match the target namespace by name, but apparently there's no way
@mumoshu I think we could use the well known label kubernetes.io/metadata.name
:
namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: my-github-runners
According to the docs this is always set to the namespace name and is immutable. So by using this rather than name: ...
things would hook up correctly I believe, that is unless the name: label
is being used somewhere else.
If this is the only place name is used I'd think this would be the following change in webhook_configs.yaml:
namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: {{ default .Release.Namespace .Values.scope.watchNamespace }}
Describe the bug After the upgrade to the ARC to 0.22.1 (reproduced with 0.22.2) (ChartVersion 0.17.1/0.17.2), we noticed that newly created pods from our RunnerSet fail to start and shows the error message in the logs:
RUNNER_NAME must be set
I did rollback the controller to 0.22.0 (0.17.0) and the RunnerSets are starting like normal.
RunnerDeployments are working in all versions.
Checks
To Reproduce
Install actions-runner-controller chart 0.17.1 with the following values. The pre-created secret holds the credentials from the GitHub App that is registered in the organisation.
Create a RunnerSet with this manifest:
runner container exits with
RC=1
and the following log:Log of the manager container in ARC:
Delete the RunnerSet
Rollback the ARC to Chart 0.17.0
Recreate the RunnerSet => Runner Pods are starting, normally and registered at the GitHub organization
Expected behavior Runner Pods from the RunnerSet StatefulSets start up and register successfully in the GitHub organization
Environment (please complete the following information):