evryfs / github-actions-runner-operator

K8S operator for scheduling github actions runner pods
Apache License 2.0
430 stars 53 forks source link

Negative CURRENTPOOLSIZE for GithubActionRunner #85

Closed andypanix closed 4 years ago

andypanix commented 4 years ago

Hi @davidkarlsen,

I've noticed a strange behavior. After running some jobs and after seeing the runner pool pods correctly scaling up and down, the runner pool pods scaled to 0. Inspecting the cluster this is what I saw:

> kubectl get po
NAME                                              READY   STATUS    RESTARTS   AGE
github-actions-runner-operator-5b99949466-tgnhh   1/1     Running   0          54m
> kubectl get GithubActionRunner/runner-pool
NAME          CURRENTPOOLSIZE
runner-pool   -517

Besides, on GitHub "Self-hosted runners" config pane I see two (non-existant) offline runners:

image

I'm using version v2.2.0 of github-actions-runner-operator chart.

Testing again (with a new clean installation of the operator and the runner), I was able to replicate the problem that seems related to the termination of the runner pod:

> k get po && k get GithubActionRunner/runner-pool
NAME                                              READY   STATUS    RESTARTS   AGE
github-actions-runner-operator-775997cd8c-sl6h6   1/1     Running   0          26m
runner-pool-pod-44l6p                             2/2     Running   0          19m
runner-pool-pod-dr56b                             2/2     Running   0          16m
runner-pool-pod-dx7c4                             2/2     Running   0          21m
runner-pool-pod-glxrt                             2/2     Running   1          7m37s
runner-pool-pod-lhjxv                             2/2     Running   0          25m
runner-pool-pod-n8ggf                             2/2     Running   1          10m
runner-pool-pod-s6qkn                             2/2     Running   0          18m
runner-pool-pod-szfkq                             2/2     Running   0          20m
runner-pool-pod-wqmcb                             2/2     Running   0          4m36s
NAME          CURRENTPOOLSIZE
runner-pool   9

 ~/www/github/actions-runner .............................................................................................. spark-op-services/saas-github kube | 14:18:18
> k get po && k get GithubActionRunner/runner-pool
NAME                                              READY   STATUS    RESTARTS   AGE
github-actions-runner-operator-775997cd8c-sl6h6   1/1     Running   0          29m
runner-pool-pod-44l6p                             2/2     Running   0          23m
runner-pool-pod-dr56b                             2/2     Running   0          20m
runner-pool-pod-dx7c4                             2/2     Running   0          25m
runner-pool-pod-glxrt                             2/2     Running   1          11m
runner-pool-pod-lhjxv                             2/2     Running   0          29m
runner-pool-pod-n8ggf                             2/2     Running   1          14m
runner-pool-pod-szfkq                             2/2     Running   0          24m
NAME          CURRENTPOOLSIZE
runner-pool   -180

 ~/www/github/actions-runner .............................................................................................. spark-op-services/saas-github kube | 14:21:49
> k get po && k get GithubActionRunner/runner-pool
NAME                                              READY   STATUS        RESTARTS   AGE
github-actions-runner-operator-775997cd8c-sl6h6   1/1     Running       0          34m
runner-pool-pod-44l6p                             2/2     Running       0          28m
runner-pool-pod-dr56b                             2/2     Running       0          25m
runner-pool-pod-dx7c4                             2/2     Running       0          30m
runner-pool-pod-glxrt                             2/2     Running       1          16m
runner-pool-pod-lhjxv                             2/2     Running       0          34m
runner-pool-pod-n8ggf                             2/2     Terminating   1          19m
runner-pool-pod-szfkq                             2/2     Running       0          29m
NAME          CURRENTPOOLSIZE
runner-pool   -217

And here some interesting events:

> kubectl get events
15m         Normal    Pulling             pod/runner-pool-pod-wqmcb                              Pulling image "sparkfabrik/github-actions-runner:latest"
15m         Normal    Pulled              pod/runner-pool-pod-wqmcb                              Successfully pulled image "sparkfabrik/github-actions-runner:latest"
15m         Normal    Created             pod/runner-pool-pod-wqmcb                              Created container runner
15m         Normal    Pulling             pod/runner-pool-pod-wqmcb                              Pulling image "docker:stable-dind"
15m         Normal    Started             pod/runner-pool-pod-wqmcb                              Started container runner
15m         Normal    Pulled              pod/runner-pool-pod-wqmcb                              Successfully pulled image "docker:stable-dind"
15m         Normal    Created             pod/runner-pool-pod-wqmcb                              Created container docker
15m         Normal    Started             pod/runner-pool-pod-wqmcb                              Started container docker
9m32s       Normal    Killing             pod/runner-pool-pod-s6qkn                              Stopping container docker
2m30s       Normal    Scaling             githubactionrunner/runner-pool                         (combined from similar events): Deleted pod saas-github/runner-pool-pod-n8ggf
9m32s       Normal    Killing             pod/runner-pool-pod-s6qkn                              Stopping container runner
9m31s       Normal    Killing             pod/runner-pool-pod-wqmcb                              Stopping container docker
9m31s       Normal    Killing             pod/runner-pool-pod-wqmcb                              Stopping container runner
9m27s       Normal    Scaling             githubactionrunner/runner-pool                         Deleted pod saas-github/runner-pool-pod-wqmcb
9m27s       Normal    Scaling             githubactionrunner/runner-pool                         Deleted pod saas-github/runner-pool-pod-s6qkn
8m59s       Warning   FailedPreStopHook   pod/runner-pool-pod-s6qkn                              Exec lifecycle hook ([/bin/bash -c /remove_runner.sh]) for Container "runner" in Pod "runner-pool-pod-s6qkn_saas-github(742ecdfe-e213-11ea-8e34-42010a8400d1)" failed - error: command '/bin/bash -c /remove_runner.sh' exited with 137: , message: "\n# Runner removal\n\n"
8m58s       Warning   FailedPreStopHook   pod/runner-pool-pod-wqmcb                              Exec lifecycle hook ([/bin/bash -c /remove_runner.sh]) for Container "runner" in Pod "runner-pool-pod-wqmcb_saas-github(6c08a1c6-e215-11ea-8e34-42010a8400d1)" failed - error: command '/bin/bash -c /remove_runner.sh' exited with 137: , message: "\n# Runner removal\n\n"
2m30s       Normal    Killing             pod/runner-pool-pod-n8ggf                              Stopping container docker
2m30s       Normal    Killing             pod/runner-pool-pod-n8ggf                              Stopping container runner
115s        Normal    Killing             pod/runner-pool-pod-dx7c4                              Stopping container docker
115s        Normal    Killing             pod/runner-pool-pod-dx7c4                              Stopping container runner
100s        Normal    Scaling             githubactionrunner/runner-pool                         Deleted pod saas-github/runner-pool-pod-dx7c4

Thanks!

davidkarlsen commented 4 years ago

Make sure you configure the pods like https://github.com/evryfs/github-actions-runner-operator/blob/master/config/samples/garo_v1alpha1_githubactionrunner.yaml#L55 and https://github.com/evryfs/github-actions-runner-operator/blob/master/config/samples/garo_v1alpha1_githubactionrunner.yaml#L66 so that they are able to unregister cleanly.

GitHub
evryfs/github-actions-runner-operator
K8S operator for scheduling github actions runner pods - evryfs/github-actions-runner-operator
GitHub
evryfs/github-actions-runner-operator
K8S operator for scheduling github actions runner pods - evryfs/github-actions-runner-operator
andypanix commented 4 years ago

This is my runner manifest. It seems fine 🤔

apiVersion: garo.tietoevry.com/v1alpha1
kind: GithubActionRunner
metadata:
  name: runner-pool
spec:
  # minimum running pods
  minRunners: 1
  # max number of pods
  maxRunners: 30
  organization: sparkfabrik
  # How often it will reconcile
  reconciliationPeriod: 1m
  # if runner for repo
  # repository: "theRepoName"
  tokenRef:
    key: GH_TOKEN
    name: actions-runner
  podTemplateSpec:
    spec:
      containers:
        - name: runner
          env:
            - name: RUNNER_DEBUG
              value: "true"
            - name: DOCKER_TLS_CERTDIR
              value: /certs
            - name: DOCKER_HOST
              value: tcp://localhost:2376
            - name: DOCKER_TLS_VERIFY
              value: "1"
            - name: DOCKER_CERT_PATH
              value: /certs/client
            - name: GH_ORG
              value: sparkfabrik
            - name: RUNNER_LABELS
              value: sparkfabrik,spark-ubuntu-18.04
          # if runner for repo:
          # - name: GH_REPO
          #   value: theRepoName
          envFrom:
            - secretRef:
                name: actions-runner
          image: quay.io/evryfs/github-actions-runner:latest
          imagePullPolicy: Always
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/bash
                  - -c
                  - /remove_runner.sh
          resources: {}
          volumeMounts:
            - mountPath: /certs
              name: docker-certs
            - mountPath: /home/runner/.m2
              name: mvn-repo
            - name: sharedv
              mountPath: /home/runner/_work
        - name: docker
          env:
            - name: DOCKER_TLS_CERTDIR
              value: /certs
          image: docker:stable-dind
          imagePullPolicy: Always
          resources: {}
          securityContext:
            privileged: true
          volumeMounts:
            - mountPath: /var/lib/docker
              name: docker-storage
            - mountPath: /certs
              name: docker-certs
            - name: sharedv
              mountPath: /home/runner/_work
      volumes:
        - emptyDir: {}
          name: mvn-repo
        - emptyDir: {}
          name: docker-storage
        - emptyDir: {}
          name: docker-certs
        - name: sharedv
          emptyDir: {}
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: cloud.google.com/gke-nodepool
                    operator: In
                    values:
                      - gitlab-ci-build
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              preference:
                matchExpressions:
                  - key: cloud.google.com/gke-nodepool
                    operator: In
                    values:
                      - gitlab-ci-build
andypanix commented 4 years ago

I'm suspecting it's a resource problem.

Let's me try assigning a minimum of resources request to the runner and let's see if I can reproduce again the error.

davidkarlsen commented 4 years ago
Exec lifecycle hook ([/bin/bash -c /remove_runner.sh]) for Container "runner" in Pod "runner-pool-pod-s6qkn_saas-github(742ecdfe-e213-11ea-8e34-42010a8400d1)" failed - error: command '/bin/bash -c /remove_runner.sh' exited with 137: , message: "\n# Runner removal\n\n"

is the problem to solve. Is the token still valid? Does it have sufficient rights? Is the token key named GH_TOKEN https://github.com/evryfs/github-actions-runner/blob/master/remove_runner.sh#L14 ?

andypanix commented 4 years ago
Exec lifecycle hook ([/bin/bash -c /remove_runner.sh]) for Container "runner" in Pod "runner-pool-pod-s6qkn_saas-github(742ecdfe-e213-11ea-8e34-42010a8400d1)" failed - error: command '/bin/bash -c /remove_runner.sh' exited with 137: , message: "\n# Runner removal\n\n"

is the problem to solve. Is the token still valid? Does it have sufficient rights? Is the token key named GH_TOKEN https://github.com/evryfs/github-actions-runner/blob/master/remove_runner.sh#L14 ?

Yes, the token is ok, and still valid.

Confirming that it doesn't seem to be a token related problem is that most runners are terminated without errors.

Monitoring memory usage of runner pods, I've seen that some of them are using more then 2600Mi or ram. I've added a resource request in runner manifest as this:

          resources:
            requests:
              cpu: 500m
              memory: 2048Mi

and a preliminary test seems to confirm that the problem is solved. The scale up / down of runner pods is ok and they are cleanly unregistered (and no more FailedPreStopHook warnings in k8s events).

But when the workflow's queue is done and the runners scale back to 1 in idle state, the currentpoolsize is still to a negative value:

NAME                                            READY   STATUS    RESTARTS   AGE
github-actions-runner-operator-d54d7bff-842hj   1/1     Running   0          67m
runner-pool-pod-vzfx6                           2/2     Running   0          22m
NAME          CURRENTPOOLSIZE
runner-pool   -19

Anyway, this seems not to be a problem, since when scaling up again, the CURRENTPOOLSIZE value is fine:

NAME                                            READY   STATUS    RESTARTS   AGE
github-actions-runner-operator-d54d7bff-842hj   1/1     Running   0          72m
runner-pool-pod-dtw9n                           2/2     Running   0          2m58s
runner-pool-pod-lqpqg                           2/2     Running   0          4m58s
runner-pool-pod-n547m                           2/2     Running   0          57s
runner-pool-pod-nzm7w                           2/2     Running   0          3m58s
runner-pool-pod-smqrs                           2/2     Running   0          5m59s
runner-pool-pod-vb4hc                           2/2     Running   0          117s
runner-pool-pod-vzfx6                           2/2     Running   0          28m
NAME          CURRENTPOOLSIZE
runner-pool   7

But when scaling down, the CURRENTPOOLSIZE value is still incorrect:

NAME                                            READY   STATUS        RESTARTS   AGE
github-actions-runner-operator-d54d7bff-842hj   1/1     Running       0          76m
runner-pool-pod-dtw9n                           2/2     Running       0          7m10s
runner-pool-pod-lbbmk                           2/2     Running       0          4m9s
runner-pool-pod-lqpqg                           2/2     Running       0          9m10s
runner-pool-pod-n547m                           0/2     Terminating   0          5m9s
runner-pool-pod-nzm7w                           0/2     Terminating   0          8m10s
runner-pool-pod-vzfx6                           2/2     Running       0          32m
NAME          CURRENTPOOLSIZE
runner-pool   -14
GitHub
evryfs/github-actions-runner
Contribute to evryfs/github-actions-runner development by creating an account on GitHub.
davidkarlsen commented 4 years ago

I also found a bug which was introduced in 77bb97f6 due to: https://github.com/operator-framework/operator-sdk/issues/1164

andypanix commented 4 years ago

Hi @davidkarlsen

I'm still getting negative values, apparently when terminating the runner, the counter goes negative 🤔

I'm using chart version 2.2.1

Every 5.0s: kubectl get po && kubectl get GithubActionRunner/runner-pool Maxwell: Fri Aug 21 12:59:01 2020

NAME                                              READY   STATUS    RESTARTS   AGE
github-actions-runner-operator-5fb9974c44-bdqfj   1/1     Running   0          36m
runner-pool-pod-4jd64                             2/2     Running   0          36m
runner-pool-pod-6q6h5                             2/2     Running   0          4m55s
runner-pool-pod-czch4                             2/2     Running   0          6m56s
runner-pool-pod-f7xbf                             2/2     Running   0          7m57s
runner-pool-pod-tjclm                             2/2     Running   0          13m
runner-pool-pod-zzk5r                             2/2     Running   0          8m57s
NAME          CURRENTPOOLSIZE
runner-pool   -9

NAME                                              READY   STATUS    RESTARTS   AGE
github-actions-runner-operator-5fb9974c44-bdqfj   1/1     Running   0          37m
runner-pool-pod-4jd64                             2/2     Running   0          36m
runner-pool-pod-6q6h5                             2/2     Running   0          5m38s
runner-pool-pod-czch4                             2/2     Running   0          7m39s
runner-pool-pod-f7xbf                             2/2     Running   0          8m40s
runner-pool-pod-m44kb                             2/2     Running   0          37s
runner-pool-pod-tjclm                             2/2     Running   0          14m
runner-pool-pod-zzk5r                             2/2     Running   0          9m40s
NAME          CURRENTPOOLSIZE
runner-pool   7

NAME                                              READY   STATUS        RESTARTS   AGE
github-actions-runner-operator-5fb9974c44-bdqfj   1/1     Running       0          38m
runner-pool-pod-4jd64                             2/2     Running       0          38m
runner-pool-pod-6q6h5                             2/2     Terminating   0          7m3s
runner-pool-pod-czch4                             2/2     Running       0          9m4s
runner-pool-pod-f7xbf                             2/2     Terminating   0          10m
runner-pool-pod-zzk5r                             2/2     Running       0          11m
NAME          CURRENTPOOLSIZE
runner-pool   -14

NAME                                              READY   STATUS        RESTARTS   AGE
github-actions-runner-operator-5fb9974c44-bdqfj   1/1     Running       0          38m
runner-pool-pod-4jd64                             2/2     Running       0          38m
runner-pool-pod-czch4                             2/2     Running       0          9m20s
runner-pool-pod-f7xbf                             0/2     Terminating   0          10m
runner-pool-pod-zzk5r                             2/2     Running       0          11m
NAME          CURRENTPOOLSIZE
runner-pool   -20

NAME                                              READY   STATUS    RESTARTS   AGE
github-actions-runner-operator-5fb9974c44-bdqfj   1/1     Running   0          39m
runner-pool-pod-4jd64                             2/2     Running   0          38m
runner-pool-pod-czch4                             2/2     Running   0          9m41s
runner-pool-pod-zzk5r                             2/2     Running   0          11m
NAME          CURRENTPOOLSIZE
runner-pool   -20

NAME                                              READY   STATUS    RESTARTS   AGE
github-actions-runner-operator-5fb9974c44-bdqfj   1/1     Running   0          39m
runner-pool-pod-4jd64                             2/2     Running   0          39m
runner-pool-pod-686gg                             2/2     Running   0          5s
runner-pool-pod-czch4                             2/2     Running   0          10m
runner-pool-pod-zzk5r                             2/2     Running   0          12m
NAME          CURRENTPOOLSIZE
runner-pool   4

NAME                                              READY   STATUS        RESTARTS   AGE
github-actions-runner-operator-5fb9974c44-bdqfj   1/1     Running       0          41m
runner-pool-pod-4jd64                             0/2     Terminating   0          41m
runner-pool-pod-686gg                             2/2     Terminating   0          2m7s
runner-pool-pod-czch4                             0/2     Terminating   0          12m
runner-pool-pod-zzk5r                             0/2     Terminating   0          14m
NAME          CURRENTPOOLSIZE
runner-pool   -11

NAME                                              READY   STATUS        RESTARTS   AGE
github-actions-runner-operator-5fb9974c44-bdqfj   1/1     Running       0          42m
runner-pool-pod-4jd64                             0/2     Terminating   0          41m
NAME          CURRENTPOOLSIZE
runner-pool   -11

NAME                                              READY   STATUS    RESTARTS   AGE
github-actions-runner-operator-5fb9974c44-bdqfj   1/1     Running   0          42m
NAME          CURRENTPOOLSIZE
runner-pool   -11

NAME                                              READY   STATUS    RESTARTS   AGE
github-actions-runner-operator-5fb9974c44-bdqfj   1/1     Running   0          42m
runner-pool-pod-5vs54                             2/2     Running   0          5s
NAME          CURRENTPOOLSIZE
runner-pool   1

and an error when trying to deregister the runner:

10m         Normal    Created             pod/runner-pool-pod-4rzcw        Created container docker
10m         Normal    Started             pod/runner-pool-pod-4rzcw        Started container docker
9m52s       Normal    Killing             pod/runner-pool-pod-4rzcw        Stopping container docker
9m52s       Normal    Killing             pod/runner-pool-pod-p2wrz        Stopping container docker
9m51s       Normal    Scaling             githubactionrunner/runner-pool   Deleted pod saas-github/runner-pool-pod-p2wrz
9m52s       Normal    Killing             pod/runner-pool-pod-p2wrz        Stopping container runner
9m50s       Normal    Scaling             githubactionrunner/runner-pool   Deleted pod saas-github/runner-pool-pod-5lpp7
9m52s       Normal    Scaling             githubactionrunner/runner-pool   Deleted pod saas-github/runner-pool-pod-4rzcw
9m52s       Normal    Killing             pod/runner-pool-pod-5lpp7        Stopping container docker
9m52s       Normal    Killing             pod/runner-pool-pod-5lpp7        Stopping container runner
9m52s       Normal    Killing             pod/runner-pool-pod-4rzcw        Stopping container runner
9m50s       Warning   FailedPreStopHook   pod/runner-pool-pod-p2wrz        Exec lifecycle hook ([/bin/bash -c /remove_runner.sh]) for Container "runner" in Pod "runner-pool-pod-p2wrz_saas-github(d1de119c-e3fc-11ea-8e34-42010a8400d1)" failed - error: command '/bin/bash -c /remove_runner.sh' exited with 1: Failed: Removing runner from the server
Runner "runner-pool-pod-p2wrz" is still running a job"
, message: "\n# Runner removal\n\nFailed: Removing runner from the server\nRunner \"runner-pool-pod-p2wrz\" is still running a job\"\n"
8m52s       Normal    Killing             pod/runner-pool-pod-q9q8m        Stopping container runner
8m50s       Normal    Scaling             githubactionrunner/runner-pool   Deleted pod saas-github/runner-pool-pod-q9q8m
8m52s       Normal    Killing             pod/runner-pool-pod-q9q8m        Stopping container docker
davidkarlsen commented 4 years ago

Can you try image.tag=v0.5.1-6-g73df6ea(https://quay.io/repository/evryfs/github-actions-runner-operator?tag=v0.5.1-6-g73df6ea&tab=tags) I have some preventive actions, although it is not bullet proof as explained in https://github.com/evryfs/github-actions-runner-operator/blob/master/README.md#L32

GitHub
evryfs/github-actions-runner-operator
K8S operator for scheduling github actions runner pods - evryfs/github-actions-runner-operator
andypanix commented 4 years ago

Hi @davidkarlsen I see there are more recent images than v0.5.1-6-g73df6ea, should I still stay on 0.5.1-6 or try the latest one?

Thanks A.

davidkarlsen commented 4 years ago

I used some duct tape sleep to fix it. Did a release and chart-update.

andypanix commented 4 years ago

Great! ;)

And from a first test, I can confirm that from the problem is resolved.

Thanks a lot! A.