Closed andypanix closed 4 years ago
Make sure you configure the pods like https://github.com/evryfs/github-actions-runner-operator/blob/master/config/samples/garo_v1alpha1_githubactionrunner.yaml#L55 and https://github.com/evryfs/github-actions-runner-operator/blob/master/config/samples/garo_v1alpha1_githubactionrunner.yaml#L66 so that they are able to unregister cleanly.
GitHubK8S operator for scheduling github actions runner pods - evryfs/github-actions-runner-operator
GitHubK8S operator for scheduling github actions runner pods - evryfs/github-actions-runner-operator
This is my runner manifest. It seems fine 🤔
apiVersion: garo.tietoevry.com/v1alpha1
kind: GithubActionRunner
metadata:
name: runner-pool
spec:
# minimum running pods
minRunners: 1
# max number of pods
maxRunners: 30
organization: sparkfabrik
# How often it will reconcile
reconciliationPeriod: 1m
# if runner for repo
# repository: "theRepoName"
tokenRef:
key: GH_TOKEN
name: actions-runner
podTemplateSpec:
spec:
containers:
- name: runner
env:
- name: RUNNER_DEBUG
value: "true"
- name: DOCKER_TLS_CERTDIR
value: /certs
- name: DOCKER_HOST
value: tcp://localhost:2376
- name: DOCKER_TLS_VERIFY
value: "1"
- name: DOCKER_CERT_PATH
value: /certs/client
- name: GH_ORG
value: sparkfabrik
- name: RUNNER_LABELS
value: sparkfabrik,spark-ubuntu-18.04
# if runner for repo:
# - name: GH_REPO
# value: theRepoName
envFrom:
- secretRef:
name: actions-runner
image: quay.io/evryfs/github-actions-runner:latest
imagePullPolicy: Always
lifecycle:
preStop:
exec:
command:
- /bin/bash
- -c
- /remove_runner.sh
resources: {}
volumeMounts:
- mountPath: /certs
name: docker-certs
- mountPath: /home/runner/.m2
name: mvn-repo
- name: sharedv
mountPath: /home/runner/_work
- name: docker
env:
- name: DOCKER_TLS_CERTDIR
value: /certs
image: docker:stable-dind
imagePullPolicy: Always
resources: {}
securityContext:
privileged: true
volumeMounts:
- mountPath: /var/lib/docker
name: docker-storage
- mountPath: /certs
name: docker-certs
- name: sharedv
mountPath: /home/runner/_work
volumes:
- emptyDir: {}
name: mvn-repo
- emptyDir: {}
name: docker-storage
- emptyDir: {}
name: docker-certs
- name: sharedv
emptyDir: {}
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-nodepool
operator: In
values:
- gitlab-ci-build
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: cloud.google.com/gke-nodepool
operator: In
values:
- gitlab-ci-build
I'm suspecting it's a resource problem.
Let's me try assigning a minimum of resources request to the runner and let's see if I can reproduce again the error.
Exec lifecycle hook ([/bin/bash -c /remove_runner.sh]) for Container "runner" in Pod "runner-pool-pod-s6qkn_saas-github(742ecdfe-e213-11ea-8e34-42010a8400d1)" failed - error: command '/bin/bash -c /remove_runner.sh' exited with 137: , message: "\n# Runner removal\n\n"
is the problem to solve.
Is the token still valid? Does it have sufficient rights? Is the token key named GH_TOKEN
https://github.com/evryfs/github-actions-runner/blob/master/remove_runner.sh#L14 ?
Exec lifecycle hook ([/bin/bash -c /remove_runner.sh]) for Container "runner" in Pod "runner-pool-pod-s6qkn_saas-github(742ecdfe-e213-11ea-8e34-42010a8400d1)" failed - error: command '/bin/bash -c /remove_runner.sh' exited with 137: , message: "\n# Runner removal\n\n"
is the problem to solve. Is the token still valid? Does it have sufficient rights? Is the token key named
GH_TOKEN
https://github.com/evryfs/github-actions-runner/blob/master/remove_runner.sh#L14 ?
Yes, the token is ok, and still valid.
Confirming that it doesn't seem to be a token related problem is that most runners are terminated without errors.
Monitoring memory usage of runner pods, I've seen that some of them are using more then 2600Mi or ram. I've added a resource request in runner manifest as this:
resources:
requests:
cpu: 500m
memory: 2048Mi
and a preliminary test seems to confirm that the problem is solved. The scale up / down of runner pods is ok and they are cleanly unregistered (and no more FailedPreStopHook warnings in k8s events).
But when the workflow's queue is done and the runners scale back to 1 in idle state, the currentpoolsize is still to a negative value:
NAME READY STATUS RESTARTS AGE
github-actions-runner-operator-d54d7bff-842hj 1/1 Running 0 67m
runner-pool-pod-vzfx6 2/2 Running 0 22m
NAME CURRENTPOOLSIZE
runner-pool -19
Anyway, this seems not to be a problem, since when scaling up again, the CURRENTPOOLSIZE value is fine:
NAME READY STATUS RESTARTS AGE
github-actions-runner-operator-d54d7bff-842hj 1/1 Running 0 72m
runner-pool-pod-dtw9n 2/2 Running 0 2m58s
runner-pool-pod-lqpqg 2/2 Running 0 4m58s
runner-pool-pod-n547m 2/2 Running 0 57s
runner-pool-pod-nzm7w 2/2 Running 0 3m58s
runner-pool-pod-smqrs 2/2 Running 0 5m59s
runner-pool-pod-vb4hc 2/2 Running 0 117s
runner-pool-pod-vzfx6 2/2 Running 0 28m
NAME CURRENTPOOLSIZE
runner-pool 7
But when scaling down, the CURRENTPOOLSIZE value is still incorrect:
NAME READY STATUS RESTARTS AGE
github-actions-runner-operator-d54d7bff-842hj 1/1 Running 0 76m
runner-pool-pod-dtw9n 2/2 Running 0 7m10s
runner-pool-pod-lbbmk 2/2 Running 0 4m9s
runner-pool-pod-lqpqg 2/2 Running 0 9m10s
runner-pool-pod-n547m 0/2 Terminating 0 5m9s
runner-pool-pod-nzm7w 0/2 Terminating 0 8m10s
runner-pool-pod-vzfx6 2/2 Running 0 32m
NAME CURRENTPOOLSIZE
runner-pool -14
GitHubContribute to evryfs/github-actions-runner development by creating an account on GitHub.
I also found a bug which was introduced in 77bb97f6 due to: https://github.com/operator-framework/operator-sdk/issues/1164
Hi @davidkarlsen
I'm still getting negative values, apparently when terminating the runner, the counter goes negative 🤔
I'm using chart version 2.2.1
Every 5.0s: kubectl get po && kubectl get GithubActionRunner/runner-pool Maxwell: Fri Aug 21 12:59:01 2020
NAME READY STATUS RESTARTS AGE
github-actions-runner-operator-5fb9974c44-bdqfj 1/1 Running 0 36m
runner-pool-pod-4jd64 2/2 Running 0 36m
runner-pool-pod-6q6h5 2/2 Running 0 4m55s
runner-pool-pod-czch4 2/2 Running 0 6m56s
runner-pool-pod-f7xbf 2/2 Running 0 7m57s
runner-pool-pod-tjclm 2/2 Running 0 13m
runner-pool-pod-zzk5r 2/2 Running 0 8m57s
NAME CURRENTPOOLSIZE
runner-pool -9
NAME READY STATUS RESTARTS AGE
github-actions-runner-operator-5fb9974c44-bdqfj 1/1 Running 0 37m
runner-pool-pod-4jd64 2/2 Running 0 36m
runner-pool-pod-6q6h5 2/2 Running 0 5m38s
runner-pool-pod-czch4 2/2 Running 0 7m39s
runner-pool-pod-f7xbf 2/2 Running 0 8m40s
runner-pool-pod-m44kb 2/2 Running 0 37s
runner-pool-pod-tjclm 2/2 Running 0 14m
runner-pool-pod-zzk5r 2/2 Running 0 9m40s
NAME CURRENTPOOLSIZE
runner-pool 7
NAME READY STATUS RESTARTS AGE
github-actions-runner-operator-5fb9974c44-bdqfj 1/1 Running 0 38m
runner-pool-pod-4jd64 2/2 Running 0 38m
runner-pool-pod-6q6h5 2/2 Terminating 0 7m3s
runner-pool-pod-czch4 2/2 Running 0 9m4s
runner-pool-pod-f7xbf 2/2 Terminating 0 10m
runner-pool-pod-zzk5r 2/2 Running 0 11m
NAME CURRENTPOOLSIZE
runner-pool -14
NAME READY STATUS RESTARTS AGE
github-actions-runner-operator-5fb9974c44-bdqfj 1/1 Running 0 38m
runner-pool-pod-4jd64 2/2 Running 0 38m
runner-pool-pod-czch4 2/2 Running 0 9m20s
runner-pool-pod-f7xbf 0/2 Terminating 0 10m
runner-pool-pod-zzk5r 2/2 Running 0 11m
NAME CURRENTPOOLSIZE
runner-pool -20
NAME READY STATUS RESTARTS AGE
github-actions-runner-operator-5fb9974c44-bdqfj 1/1 Running 0 39m
runner-pool-pod-4jd64 2/2 Running 0 38m
runner-pool-pod-czch4 2/2 Running 0 9m41s
runner-pool-pod-zzk5r 2/2 Running 0 11m
NAME CURRENTPOOLSIZE
runner-pool -20
NAME READY STATUS RESTARTS AGE
github-actions-runner-operator-5fb9974c44-bdqfj 1/1 Running 0 39m
runner-pool-pod-4jd64 2/2 Running 0 39m
runner-pool-pod-686gg 2/2 Running 0 5s
runner-pool-pod-czch4 2/2 Running 0 10m
runner-pool-pod-zzk5r 2/2 Running 0 12m
NAME CURRENTPOOLSIZE
runner-pool 4
NAME READY STATUS RESTARTS AGE
github-actions-runner-operator-5fb9974c44-bdqfj 1/1 Running 0 41m
runner-pool-pod-4jd64 0/2 Terminating 0 41m
runner-pool-pod-686gg 2/2 Terminating 0 2m7s
runner-pool-pod-czch4 0/2 Terminating 0 12m
runner-pool-pod-zzk5r 0/2 Terminating 0 14m
NAME CURRENTPOOLSIZE
runner-pool -11
NAME READY STATUS RESTARTS AGE
github-actions-runner-operator-5fb9974c44-bdqfj 1/1 Running 0 42m
runner-pool-pod-4jd64 0/2 Terminating 0 41m
NAME CURRENTPOOLSIZE
runner-pool -11
NAME READY STATUS RESTARTS AGE
github-actions-runner-operator-5fb9974c44-bdqfj 1/1 Running 0 42m
NAME CURRENTPOOLSIZE
runner-pool -11
NAME READY STATUS RESTARTS AGE
github-actions-runner-operator-5fb9974c44-bdqfj 1/1 Running 0 42m
runner-pool-pod-5vs54 2/2 Running 0 5s
NAME CURRENTPOOLSIZE
runner-pool 1
and an error when trying to deregister the runner:
10m Normal Created pod/runner-pool-pod-4rzcw Created container docker
10m Normal Started pod/runner-pool-pod-4rzcw Started container docker
9m52s Normal Killing pod/runner-pool-pod-4rzcw Stopping container docker
9m52s Normal Killing pod/runner-pool-pod-p2wrz Stopping container docker
9m51s Normal Scaling githubactionrunner/runner-pool Deleted pod saas-github/runner-pool-pod-p2wrz
9m52s Normal Killing pod/runner-pool-pod-p2wrz Stopping container runner
9m50s Normal Scaling githubactionrunner/runner-pool Deleted pod saas-github/runner-pool-pod-5lpp7
9m52s Normal Scaling githubactionrunner/runner-pool Deleted pod saas-github/runner-pool-pod-4rzcw
9m52s Normal Killing pod/runner-pool-pod-5lpp7 Stopping container docker
9m52s Normal Killing pod/runner-pool-pod-5lpp7 Stopping container runner
9m52s Normal Killing pod/runner-pool-pod-4rzcw Stopping container runner
9m50s Warning FailedPreStopHook pod/runner-pool-pod-p2wrz Exec lifecycle hook ([/bin/bash -c /remove_runner.sh]) for Container "runner" in Pod "runner-pool-pod-p2wrz_saas-github(d1de119c-e3fc-11ea-8e34-42010a8400d1)" failed - error: command '/bin/bash -c /remove_runner.sh' exited with 1: Failed: Removing runner from the server
Runner "runner-pool-pod-p2wrz" is still running a job"
, message: "\n# Runner removal\n\nFailed: Removing runner from the server\nRunner \"runner-pool-pod-p2wrz\" is still running a job\"\n"
8m52s Normal Killing pod/runner-pool-pod-q9q8m Stopping container runner
8m50s Normal Scaling githubactionrunner/runner-pool Deleted pod saas-github/runner-pool-pod-q9q8m
8m52s Normal Killing pod/runner-pool-pod-q9q8m Stopping container docker
Can you try image.tag=v0.5.1-6-g73df6ea
(https://quay.io/repository/evryfs/github-actions-runner-operator?tag=v0.5.1-6-g73df6ea&tab=tags)
I have some preventive actions, although it is not bullet proof as explained in https://github.com/evryfs/github-actions-runner-operator/blob/master/README.md#L32
GitHubK8S operator for scheduling github actions runner pods - evryfs/github-actions-runner-operator
Hi @davidkarlsen I see there are more recent images than v0.5.1-6-g73df6ea, should I still stay on 0.5.1-6 or try the latest one?
Thanks A.
I used some duct tape sleep to fix it.
Did a release and chart-update.
Great! ;)
And from a first test, I can confirm that from the problem is resolved.
Thanks a lot! A.
Hi @davidkarlsen,
I've noticed a strange behavior. After running some jobs and after seeing the runner pool pods correctly scaling up and down, the runner pool pods scaled to 0. Inspecting the cluster this is what I saw:
Besides, on GitHub "Self-hosted runners" config pane I see two (non-existant) offline runners:
I'm using version v2.2.0 of github-actions-runner-operator chart.
Testing again (with a new clean installation of the operator and the runner), I was able to replicate the problem that seems related to the termination of the runner pod:
And here some interesting events:
Thanks!