Open iamcaleberic opened 7 months ago
Updating and pinning image.dindSidecarRepositoryAndTag
at docker:24.0.7-dind-alpine3.18
appears to resolve it
@iamcaleberic yes, we had the same issue and the workaround works 👍
Seeing the same issue here running in GKE. We're also dealing with a problem where this morning we ended-up with 10,000 runners (triggering secondary rate limiting) and the vast majority of them were 'offline'.
Is there any chance that there is a relationship between this and runners being left in an offline state as they fail to come online cleanly and ARC controller (v0.26.0
) not properly de-registering them from GitHub?
EDIT/UPDATE: After we implemented the fix to pin the docker sidecar to docker:24.0.7-dind-alpine3.18
we no longer saw the issue with the building 'offline' runners and believe that the two are related.
Sorry for the naive question but where are you specifying image.dindSidecarRepositoryAndTag
I'm not seeing any mention of that in the actions-runner-controller.yaml. Is this perhaps a Helm thing? Surely it has a kubectl/yaml-only representation too? Thank you for the great tips.
We are experimenting the same issue
@verult was able to patch the command directly in like this on line 34342 in version 0.26 of actions-runner-controller.yaml
containers:
- args: - args:
- --metrics-addr=127.0.0.1:8080 - --metrics-addr=127.0.0.1:8080
- --enable-leader-election - --enable-leader-election
# Temporary workaround for https://github.com/actions/actions-runner-controller/issues/3159
- --docker-image=docker:24.0.7-dind-alpine3.18
command: command:
- /manager - /manager
env: env:
Thanks for the suggestions, everyone. We got our runners working again, but the pods won't terminate. We use ephemeral runners, and the docker issue impacted us today. The runners couldn't start, and we reached a Runner Group 10k limit. Once the runners started again, they were not cleaned up and stayed in a Terminating phase. We're still trying to figure out why. We tested various versions of both the chart and app versions, but at least running v0.26.0 with --docker-image=docker:24.0.7-dind-alpine3.18 resulted in pods lingering. We're also using a custom runner image, which could be an issue. We'll keep on investigating.
@LaloLoop we ran into the issue of pods getting stuck in the Terminating phase after we deleted the runner controller, because there were finalizers left on these pods. Is your controller running when your pods are stuck?
Thanks for pointing that out @verult . We reached the rate limit as described by @billimek . That caused the controller to panic continously and fail to reconcile. We're using 0.26.0, not sur if newer versions have better error/retries handling. I guess we're gonna have to wait to have it reset before trying anything. Changing anything in our runners at the moment triggers the rate limit and everything fails, even if we don't use pulling for the auto scalers.
@joshgc you can find in here https://github.com/actions/actions-runner-controller/blob/master/charts/actions-runner-controller/values.yaml#L55
Thanks @iamcaleberic it did magic and worked for us as well.
Same error for me.
Fixed with : dindSidecarRepositoryAndTag: "docker:24.0.7-dind-alpine3.18"
For those running auto scaling runner set, I tried to update the template.spec.containers.dind to 24.0.7-dind-alpine3.18 and it didn't work. It retained the value of docker:dind. I know my syntax is correct because I also pin our custom image to containers as well.
I manually updated the CRD autoscalingrunnerset
to docker:24.0.7-dind-alpine3.18 and this seems to work as well.
My question is, why is this not pinned to a stable version instead of "latest"? It exposes us to unstable updates that can lead to downtime or interruption.
If this is still an issue for some folks and you are still dealing with ~10,000 offline runners which is triggering the secondary rate-limiting, the following script snippet may be useful to remove the offline runners,
#!/bin/bash
while true; do
echo "Fetching more runners"
RESPONSE=$(gh api \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
/orgs/<YOUR ORG>/actions/runners)
echo "Total runners: $(echo "$RESPONSE" | jq '.total_count')"
OFFLINE_RUNNERS="$(echo "$RESPONSE" | jq '.runners | map(select(.status == "offline"))')"
RUNNERS="$(echo "$OFFLINE_RUNNERS" | jq '.[].id') "
# Loop for each runner
for RUNNER in $RUNNERS; do
echo "Removing runner: $RUNNER"
gh api \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
-X DELETE \
"/orgs/<YOUR ORG>/actions/runners/$RUNNER" >> removal.logs
done
# If there was no runners, break
if [ -z "$RUNNERS" ]; then
echo "Done!"
break
fi
done
... or the following action may accomplish the same thing as well (just don't run it on self hosted runners where you are experiencing this issue!): some-natalie/runner-reaper.
It's my understanding that GitHub should automatically remove offline runners after 24h but the symptom of this issue seems to be that it will very quickly ramp up the number of offline runners making that automation not viable unless or until you correct the pinned docker version.
It also looks like the upstream docker:dind
image was corrected so your system may self correct over some time anyway.
As @iamcaleberic pointed out, if you're deploying the actions-runner-controller
helm chart, the relevant values line to override when re-deploying a fix to the chart is located here
If you're running the newer gha-runner-scale-set
chart and it's exhibiting the same issue (we don't currently run this one so it's unclear if the scale set is affected or not), it looks like the modification necessary is going to be related to the template
spec definition here.
running the newer gha-runner-scale-set
, and overriding the spec in the values.yaml with a new docker tag doesn't seem to make a difference, it stays on docker:dind
.
anyone managed to find a workaround for it?
running the newer
gha-runner-scale-set
, and overriding the spec in the values.yaml with a new docker tag doesn't seem to make a difference, it stays ondocker:dind
.anyone managed to find a workaround for it?
Update the CRD manually under autoscalingrunnerset
and patch it.
running the newer
gha-runner-scale-set
, and overriding the spec in the values.yaml with a new docker tag doesn't seem to make a difference, it stays ondocker:dind
.anyone managed to find a workaround for it?
This worked for me: https://github.com/jamezrin/personal-actions-runner-setup/blob/main/gha-runner-scale-set-dind-fix.yaml#L24C53-L24C117
Updating and pinning
image.dindSidecarRepositoryAndTag
atdocker:24.0.7-dind-alpine3.18
appears to resolve it
So, how do we do that? I have the following file below and I dont know where to add it.
apiVersion: actions.summerwind.dev/v1alpha1 kind: RunnerDeployment metadata: name: example-runnerdeploy namespace: actions-runner-system annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: "true" labels: name: example-runnerdeploy spec: replicas: 1 template: spec: repository: farrukh90/symmetrical-fortnight image: farrukhsadykov/runner:latest labels:
apiVersion: actions.summerwind.dev/v1alpha1 kind: HorizontalRunnerAutoscaler metadata: name: example-runnerdeploy namespace: actions-runner-system annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: "true" labels: name: example-runnerdeploy spec: scaleTargetRef: name: example-runnerdeploy scaleDownDelaySecondsAfterScaleOut: 300 minReplicas: 2 maxReplicas: 20 metrics:
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: example-runnerdeploy namespace: actions-runner-system spec: minAvailable: 1 selector: matchLabels: app: example-runnerdeploy
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: for-aws-tasks parameters: type: pd-standard provisioner: kubernetes.io/gce-pd reclaimPolicy: Retain volumeBindingMode: Immediate allowVolumeExpansion: false
image.dindSidecarRepositoryAndTag
is done on helm level
Is this fixed now or should we stick to the binded version of the image ?
running the newer
gha-runner-scale-set
, and overriding the spec in the values.yaml with a new docker tag doesn't seem to make a difference, it stays ondocker:dind
. anyone managed to find a workaround for it?This worked for me: https://github.com/jamezrin/personal-actions-runner-setup/blob/main/gha-runner-scale-set-dind-fix.yaml#L24C53-L24C117
For the gha scale set, I've ended up leaving container mode to be empty and updated template to include the specs to be the same as once created when container mode is dind, only with the new docker tag. Found it to be a better solution for me to continue using the help chart I've already had in hope there will be a fix that supports dind image tag from values.yaml
Fix has been implemented upstream in docker:dind, however it now requires this helm-chart / us using to set a new variable.
https://github.com/docker-library/docker/pull/468#issuecomment-1878086606
set DOCKER_IPTABLES_LEGACY=1 inside your dind pod, via an overwrite to the helm chart default variables (this should get added to the helm chart, if someone wants an easy PR)
Change should go right after these lines for the PR to the chart if someone had a minute to open it. https://github.com/actions/actions-runner-controller/blob/master/charts/gha-runner-scale-set/templates/_helpers.tpl#L106 and https://github.com/actions/actions-runner-controller/blob/master/charts/gha-runner-scale-set/values.yaml#L142
Checks
Controller Version
v0.27.6
Helm Chart Version
0.23.6
CertManager Version
1.13.2
Deployment Method
Helm
cert-manager installation
Are you sure youve install cert-manager from an official source? yes using official jetstack helm repo
Checks
Resource Definitions
To Reproduce
Describe the bug
The docker dind sidecar errors out and does not start and the runner pods ends up restarting every 120 secs, this is the timeout for docker.
Might be related to
https://github.com/docker-library/docker/commit/4c2674df4f40c965cdb8ccc77b8ce9dbc247a6c9 https://github.com/docker-library/docker/pull/437
Describe the expected behavior
dind sidecar to start.
Whole Controller Logs