actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.4k stars 1.04k forks source link

Stuck at "Job is waiting for a runner from 'runner-name' to come online" in DinD-mode #3485

Closed paranerd closed 1 month ago

paranerd commented 2 months ago

Checks

Controller Version

0.9.1

Deployment Method

Helm

Checks

To Reproduce

1. Installed ARC as per [these instructions](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/quickstart-for-actions-runner-controller#installing-actions-runner-controller)
1. Deployed a runner as per [those instructions](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/quickstart-for-actions-runner-controller#installing-actions-runner-controller)
    - Basically just downloaded the official [values](https://github.com/actions/actions-runner-controller/blob/master/charts/gha-runner-scale-set/values.yaml) to `my-values.yaml`
    - Uncommented lines 78+79 (`containerMode`)
    - Uncommented lines 114-158 (`template.spec`)
    - Set `--values "my-values.yaml`
- Installed via Helm
- Runner shows up in GitHub
- Running a job gets stuck in the above mentioned state

Describe the bug

When trying to host a DinD container, the runner shows up in GitHub but when trying to run jobs on it, it just gets stuck waiting.

Deploying a "regular" controller works as expected, though.

Describe the expected behavior

The DinD container should pick up available jobs and run them.

Additional Context

githubConfigUrl: ""

githubConfigSecret:
  github_token: ""

containerMode:
  type: "dind"

template:
  spec:
    initContainers:
    - name: init-dind-externals
      image: ghcr.io/actions/actions-runner:latest
      command: ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
      volumeMounts:
        - name: dind-externals
          mountPath: /home/runner/tmpDir
    containers:
    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]
      env:
        - name: DOCKER_HOST
          value: unix:///var/run/docker.sock
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
        - name: dind-sock
          mountPath: /var/run
    - name: dind
      image: docker:dind
      args:
        - dockerd
        - --host=unix:///var/run/docker.sock
        - --group=$(DOCKER_GROUP_GID)
      env:
        - name: DOCKER_GROUP_GID
          value: "123"
      securityContext:
        privileged: true
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
        - name: dind-sock
          mountPath: /var/run
        - name: dind-externals
          mountPath: /home/runner/externals
    volumes:
    - name: work
      emptyDir: {}
    - name: dind-sock
      emptyDir: {}
    - name: dind-externals
      emptyDir: {}

Controller Logs

https://gist.github.com/paranerd/d41dd1de26c3c18c67ae179f41afb67b

Runner Pod Logs

I don't have those as the runner never even starts in the first place.
github-actions[bot] commented 2 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

nikola-jokic commented 1 month ago

Hey @paranerd,

If you inspect the log, it says that:

2024-04-30T13:24:54Z ERROR EphemeralRunner Failed to create pod resource for ephemeral runner. {"ephemeralrunner": {"name":"arc-runner-set-docker-1-998lp-runner-9v6sv","namespace":"arc-runners-docker-1"}, "error": "Pod \"arc-runner-set-docker-1-998lp-runner-9v6sv\" is invalid: [spec.volumes[3].name: Duplicate value: \"dind-sock\", spec.volumes[4].name: Duplicate value: \"dind-externals\", spec.initContainers[1].name: Duplicate value: \"init-dind-externals\"]"}

Since you already expanded the spec, you should leave container mode commented out.

paranerd commented 1 month ago

Thanks for looking into this!

As it turns out, I'm having the same issue as described here.

I fixed it by removing the containerMode lines (as you suggested) and using the following specs:

template:
spec:
initContainers:
- name: init-dind-externals
image: [ghcr.io/actions/actions-runner:latest](http://ghcr.io/actions/actions-runner:latest)
command: ['cp', '-r', '-v', '/home/runner/externals/.', '/home/runner/tmpDir/']
volumeMounts:
- name: dind-externals
mountPath: /home/runner/tmpDir
containers:
- name: runner
image: [ghcr.io/actions/actions-runner:latest](http://ghcr.io/actions/actions-runner:latest)
command: ['/home/runner/run.sh']
env:
- name: DOCKER_HOST
value: unix:///run/docker/docker.sock
volumeMounts:
- name: work
mountPath: /home/runner/_work
- name: dind-sock
mountPath: /run/docker
readOnly: true
- name: dind
image: docker:dind
args:
- dockerd
- --host=unix:///run/docker/docker.sock
- --group=$(DOCKER_GROUP_GID)
env:
- name: DOCKER_GROUP_GID
value: '123'
- name: DOCKER_IPTABLES_LEGACY
value: '1'
resources:
requests:
memory: "500Mi"
cpu: "300m"
limits:
memory: "500Mi"
cpu: "300m"
securityContext:
privileged: true
volumeMounts:
- name: work
mountPath: /home/runner/_work
- name: dind-sock
mountPath: /run/docker
- name: dind-externals
mountPath: /home/runner/externals
volumes:
- name: work
emptyDir: {}
- name: dind-sock
emptyDir: {}
- name: dind-externals
emptyDir: {}

with an emphasis on

- name: DOCKER_IPTABLES_LEGACY
  value: '1'

which seems to be the main fix.

nikola-jokic commented 1 month ago

Thank you for letting us know! Legacy IP tables seems to be a problem on some platforms, but I'm just not sure if it should be the default spec that we expand to :confused: