Support DinD in user-specified "container:" in jobs

zxti commented 1 year ago

What would you like added?

I'm able to run ARC with docker fine using containerMode: dind.

But I'd also like to enable my users to specify container: in Actions workflow jobs, and start docker builds etc. from there.

My understanding is that specifying container: on an Actions workflow job will cause the actions-runner container in the pod to spin up another docker container. But in here, there's no socket:

docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.

Why is this needed?

Users want to run their own containers, and performing Docker builds etc. is a common CI/CD task.

Today, devops needs to be a central bottleneck in configuring runner sets with these specific containers, and making sure they work with the containerType: dind.

github-actions[bot] commented 1 year ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

nikola-jokic commented 1 year ago

Hey @zxti,

Please correct me if I misunderstood, but you can provide a volume in your workflow which can mount the docker socket to your container.

cniessigma commented 1 year ago

I believe the runner hardcodes a specific path to the Docker sock when running a container job. As far as I can tell they don't let us easily overwrite this without writing our own container hook which is very heavy.

Is the alternative to change every one of our workflow files to manually mount the docker.dock to the the runner container (by default seems the dind container mode seems to mount it from the host into /run/docker/docker.sock in the runner container instead of /var/run/docker.sock)? For example:

container:
  image: <image>
  options: -v "/run/docker/docker.sock:/var/run/docker.sock"

To me that seems like quite the leakage of implementation details to the consumer of the scale sets. Perhaps I'm misunderstanding the suggestion?

EDIT: To be clear I am able to launch the job with a docker container just fine, which I'm not able to do is run a docker build within that container job because it can't find the docker socket

piotrhryszko-img commented 1 year ago

Also interested in this, as I'm facing similary issue. I use a shared github action in my workflow

    - name: Run Trivy vulnerability scanner with exclusions
      uses: aquasecurity/trivy-action@0.12.0
      with:
        image-ref: ${{ inputs.docker_image }}
        format: 'table'
        exit-code: '1'
        ignore-unfixed: true
        trivyignores: ${{ inputs.trivy_ignore_path }}
        severity: 'CRITICAL,HIGH'
        timeout: 15m

which executes the following command

/usr/bin/docker run --name b1cbc5785fc65dd52a4c82a2774efc8b669fef_1185fa --label b1cbc5 --workdir /github/workspace --rm -e "GOOS" -e "GOARCH" -e "GOPRIVATE" -e "CGO_ENABLED" -e "INPUT_IMAGE-REF" -e "INPUT_FORMAT" -e "INPUT_EXIT-CODE" -e "INPUT_IGNORE-UNFIXED" -e "INPUT_SEVERITY" -e "INPUT_TIMEOUT" -e "INPUT_SCAN-TYPE" -e "INPUT_INPUT" -e "INPUT_SCAN-REF" -e "INPUT_VULN-TYPE" -e "INPUT_TEMPLATE" -e "INPUT_OUTPUT" -e "INPUT_SKIP-DIRS" -e "INPUT_SKIP-FILES" -e "INPUT_CACHE-DIR" -e "INPUT_IGNORE-POLICY" -e "INPUT_HIDE-PROGRESS" -e "INPUT_LIST-ALL-PKGS" -e "INPUT_SECURITY-CHECKS" -e "INPUT_TRIVYIGNORES" -e "INPUT_ARTIFACT-TYPE" -e "INPUT_GITHUB-PAT" -e "INPUT_TRIVY-CONFIG" -e "HOME" -e "GITHUB_JOB" -e "GITHUB_REF" -e "GITHUB_SHA" -e "GITHUB_REPOSITORY" -e "GITHUB_REPOSITORY_OWNER" -e "GITHUB_REPOSITORY_OWNER_ID" -e "GITHUB_RUN_ID" -e "GITHUB_RUN_NUMBER" -e "GITHUB_RETENTION_DAYS" -e "GITHUB_RUN_ATTEMPT" -e "GITHUB_REPOSITORY_ID" -e "GITHUB_ACTOR_ID" -e "GITHUB_ACTOR" -e "GITHUB_TRIGGERING_ACTOR" -e "GITHUB_WORKFLOW" -e "GITHUB_HEAD_REF" -e "GITHUB_BASE_REF" -e "GITHUB_EVENT_NAME" -e "GITHUB_SERVER_URL" -e "GITHUB_API_URL" -e "GITHUB_GRAPHQL_URL" -e "GITHUB_REF_NAME" -e "GITHUB_REF_PROTECTED" -e "GITHUB_REF_TYPE" -e "GITHUB_WORKFLOW_REF" -e "GITHUB_WORKFLOW_SHA" -e "GITHUB_WORKSPACE" -e "GITHUB_ACTION" -e "GITHUB_EVENT_PATH" -e "GITHUB_ACTION_REPOSITORY" -e "GITHUB_ACTION_REF" -e "GITHUB_PATH" -e "GITHUB_ENV" -e "GITHUB_STEP_SUMMARY" -e "GITHUB_STATE" -e "GITHUB_OUTPUT" -e "GITHUB_ACTION_PATH" -e "RUNNER_OS" -e "RUNNER_ARCH" -e "RUNNER_NAME" -e "RUNNER_ENVIRONMENT" -e "RUNNER_TOOL_CACHE" -e "RUNNER_TEMP" -e "RUNNER_WORKSPACE" -e "ACTIONS_RUNTIME_URL" -e "ACTIONS_RUNTIME_TOKEN" -e "ACTIONS_CACHE_URL" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/runner/_work/_temp/_github_home":"/github/home" -v "/home/runner/_work/_temp/_github_workflow":"/github/workflow" -v "/home/runner/_work/_temp/_runner_file_commands":"/github/file_commands" -v "/home/runner/_work/example/example":"/github/workspace" b1cbc5:785fc65dd52a4c82a2774efc8b669fef  "-a image" "-b table" "-c " "-d 1" "-e true" "-f os,library" "-g CRITICAL,HIGH" "-h " "-i example/example/example:e67777a6d" "-j ." "-k " "-l " "-m " "-n 15m" "-o " "-p " "-q " "-r false" "-s " "-t " "-u " "-v "

and fails with

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Apologies if it's not related, I'm still trying to understand the flow of actions

zetaab commented 1 year ago

dockerd can be started using hooks:

AutoscalingRunnerSet envs

        env:
        - name: ACTIONS_RUNNER_HOOK_JOB_STARTED
          value: /home/runner/hooks/common-start.sh

common-start.sh:

#!/usr/bin/env bash
set -u

source /home/runner/hooks/logger.sh
source /home/runner/hooks/wait.sh

log.debug 'Starting Docker daemon'
sudo /usr/bin/dockerd &

log.debug 'Waiting for processes to be running...'
processes=(dockerd)

for process in "${processes[@]}"; do
    if ! wait_for_process "$process"; then
        log.error "$process is not running after max time"
        exit 1
    else
        log.debug "$process is running"
    fi
done

logger and wait scripts can be seen in https://github.com/actions/actions-runner-controller/blob/master/runner/wait.sh https://github.com/actions/actions-runner-controller/blob/master/runner/logger.sh

So: 1 env variable and 3 files needs to be injected. Its up to the user how that can be done. Configmap, new image(?). We are building own image which contains for instance cached actions and settings like this.

ps. Github default image is missing quite much dependencies to get docker started. At least fuse-overlayfs and iptables is needed for that.

Oh forgot one thing:

        securityContext:
          privileged: true # needed for dockerd

that is needed as well

cniessigma commented 1 year ago

dockerd can be started using hooks:

AutoscalingRunnerSet envs
        env:

        - name: ACTIONS_RUNNER_HOOK_JOB_STARTED

          value: /home/runner/hooks/common-start.sh
common-start.sh:
#!/usr/bin/env bash

set -u

source /home/runner/hooks/logger.sh

source /home/runner/hooks/wait.sh

log.debug 'Starting Docker daemon'

sudo /usr/bin/dockerd &

log.debug 'Waiting for processes to be running...'

processes=(dockerd)

for process in "${processes[@]}"; do

    if ! wait_for_process "$process"; then

        log.error "$process is not running after max time"

        exit 1

    else

        log.debug "$process is running"

    fi

done
logger and wait scripts can be seen in https://github.com/actions/actions-runner-controller/blob/master/runner/wait.sh https://github.com/actions/actions-runner-controller/blob/master/runner/logger.sh

So: 1 env variable and 3 files needs to be injected. Its up to the user how that can be done. Configmap, new image(?). We are building own image which contains for instance cached actions and settings like this.

ps. Github default image is missing quite much dependencies to get docker started. At least fuse-overlayfs and iptables is needed for that.

Oh forgot one thing:
        securityContext:

          privileged: true # needed for dockerd
that is needed as well

Making a hook is exactly what I needed to get around this, using a modified Github's published example, though I am still a bit curious why the upstream runner project has the Docker socket file path hard coded? The hook works but is somewhat undesirable -- when the workflow fails for any reason, it spits out a "Please contact your self-hosted administrator" and I haven't found a way to get rid of that error message.

I'd rather not use a hook if all that's needed is an upstream change to be able to override that hardcoded string with an environment variable.

zetaab commented 1 year ago

yeah I do not understand why docker is not made available automatically. That is available if you use like ubuntu-latest, why its not included in runner image?

nikola-jokic commented 11 months ago

Hey everyone,

Until this is resolved, may I suggest a workaround for this particular use-case. To have it working, please comment out the entire containerMode object (or leave containerMode.type empty), and provide a dind spec as a side-car container as described here.

The containerMode object does not influence the controller in any way. It just expands the helm template, so it is more convenient to specify a dind configuration in most cases.

genisd commented 11 months ago

@nikola-jokic i think your link for the spec should not refer to the sha abc0b678d323b but to a newer one (or master). The sha referenced might be missleading since it says there

  ##         - name: DOCKER_HOST
  ##           value: tcp://localhost:2376

I would think that the current master might be better here Link with the sha at the time of writing this comment here

genisd commented 10 months ago

We just did the migration to the new runner scale sets. Basically what is important is to have the docker socket available on /var/run/docker.sock there are implicit assumptions present which expect this to be there. DOCKER_HOST is not always adhered to. One example that I can give is a step/workflow which does the following:

      "uses": "docker://example.com/dockerrepo/image:tag"

this is one of our deployments, it's an in-memory configuration. you can take inspiration from it I think. the important part is to share /var/run/ between dind and the runner

template:
  spec:
    nodeSelector:
      type: shared-16core
    containers:
      - command:
          - /home/runner/run.sh
        env:
          - name: DOCKER_HOST
            value: unix:///var/run/docker.sock
          - name: RUNNER_WAIT_FOR_DOCKER_IN_SECONDS
            value: "120"
        image: eu.gcr.io/unicorn-985/docker-images_actions-runner:v1
        name: runner
        resources:
          limits:
            cpu: "4"
            memory: 6Gi
          requests:
            cpu: 1700m
            memory: 5Gi
        volumeMounts:
          - mountPath: /home/runner/_work
            name: work
          - mountPath: /var/run
            name: dind-sock
      - args:
          - dockerd
          - --host=unix:///var/run/docker.sock
          - --group=$(DOCKER_GROUP_GID)
        env:
          - name: DOCKER_GROUP_GID
            value: "123"
        image: docker:dind
        name: dind
        securityContext:
          privileged: true
        volumeMounts:
          - mountPath: /home/runner/_work
            name: work
          - mountPath: /var/run
            name: dind-sock
          - mountPath: /home/runner/externals
            name: dind-externals
          - mountPath: /var/lib/docker
            name: dind-scratch
    initContainers:
      - args:
          - -r
          - -v
          - /home/runner/externals/.
          - /home/runner/tmpDir/
        command:
          - cp
        image: eu.gcr.io/unicorn-985/docker-images_actions-runner:v1
        name: init-dind-externals
        resources: {}
        volumeMounts:
          - mountPath: /home/runner/tmpDir
            name: dind-externals
    restartPolicy: Never
    volumes:
      - name: dind-sock
        emptyDir: {}
      - name: dind-externals
        emptyDir:
          medium: Memory
      - name: dind-scratch
        emptyDir:
          medium: Memory
      - name: work
        emptyDir:
          medium: Memory
      - name: tmp
        emptyDir:
          medium: Memory

nikola-jokic commented 10 months ago

Hey everyone, yes, the out of box socket is not positioned in the same place where runner expects it. Unfortunately, for now, please expand the dind spec by hand. :disappointed:

genisd commented 10 months ago

Should we update the README/docs to add this improvement? I think the current dind spec expansion example is not what people expect (this was at least the case when i implemented our migration). I.e. the reporter here expected his docker plugin to work

ananthu1834 commented 10 months ago

the important part is to share /var/run/ between dind and the runner

Hello everyone. I changed "/run/docker" to "/var/run" in the dind template everywhere, but that caused the runner pods to fail immediately with a "StartError". I could not find any useful logs on why this happened from any ARC related resource. The EphemeralRunner just showed the reason as "Pod has failed to start more than 5 times". Any idea what I could be missing here? (I'm using a custom runner that just adds a few libraries on top of the default runner image)

genisd commented 9 months ago

@ananthu1834 sounds like the init containter gave a non zero exit code. you should be able to get hints as to why in the logs of that startup process

ananthu1834 commented 9 months ago

Thank you @genisd. But I did check the init container. It threw a lot of logs which are only related to the files being copied over between dirs. Besides, the dind main container did run and threw the following logs in the end before terminating, suggesting that 1) init container passed? and 2) the dind container successfully started the docker daemon, but got a termination signal from outside?

Screenshot 2024-01-09 at 1 06 14 PM

Also, one more observation is that this problem is happening only when I use the path /var/run/docker.sock. Any other random path there seems to work, which means something very unique to this default path that is causing the termination. Sort of stuck here at the moment, will try digging deeper

genisd commented 9 months ago

Just to be sure, you're sharing the /var/run/ directory, not /var/run/docker.sock (like in my above example)? I tried sharing only the socket myself but that didn't work for me either, though if I recall correctly I did get conclusive errors when trying that.

You could test my example code. I think only the nodeSelector and the runner image are specific to our environment (we simply mirror the official image such that we don't have to pull it from github. We do that because once github actions has hickups the official image repo get's overwhelmed)

ananthu1834 commented 9 months ago

Ah, the example code you provided worked! (of course, without the nodeSelector and also with our own custom image and secrets). Digging deeper and isolating what the difference was, found that I had added a readOnly: true for the dind-sock volume on the runner container, as mentioned here. I had just re-used that template with just the changes I needed, like a custom image and some secrets.

Still not sure why the above change worked, but now the runner container is starting up and running as expected. Thank you @genisd for your help :)

genisd commented 9 months ago

We should update that example template I think. It's not yielding the behavior which people expect and therefore is not a good baseline to start customizing one's own environment

YvesZelros commented 9 months ago

@genisd Thanks to share your setup, it's help me to fix same issue.

It's not better to use an host volume for your dind-scratch volume as a memroy volume to re-use cache layer ?

My final configuration (securityContext are optional) =>

githubConfigUrl: https://github.com/<my org>
githubConfigSecret: <secret name with app id & private key>
maxRunners: 28
controllerServiceAccount:
   namespace: gha-runner
   name: gha-runner-gha-rs-controller
listenerTemplate:
  spec:
    securityContext:
      runAsNonRoot: true
      runAsUser: 1001
      runAsGroup: 123
      seccompProfile:
        type: RuntimeDefault
    containers:
    - name: listener
      securityContext:
        readOnlyRootFilesystem: true
        allowPrivilegeEscalation: false
        seccompProfile:
          type: RuntimeDefault
        capabilities:
          drop:
            - ALL
template:
  spec:
    securityContext:
      fsGroup: 123
      seccompProfile:
        type: RuntimeDefault
    restartPolicy: Never
    volumes:
      - name: work
        emptyDir:
          medium: Memory
          sizeLimit: "4Gi"
      - name: dind-sock
        emptyDir: {}
      - name: dind-externals
        emptyDir:
          medium: Memory
    initContainers:
    - name: init-dind-externals
      image: ghcr.io/actions/actions-runner:2.311.0
      command: ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
      volumeMounts:
        - name: dind-externals
          mountPath: /home/runner/tmpDir
      securityContext:
        runAsUser: 1001
        runAsGroup: 123
        allowPrivilegeEscalation: false
        seccompProfile:
          type: RuntimeDefault
        capabilities:
          drop:
            - ALL
    containers:
    - name: runner
      image:  ghcr.io/actions/actions-runner:2.311.0
      command: ["/home/runner/run.sh"]
      env:
       - name: DOCKER_HOST
         value: unix:///var/run/docker.sock
       - name: RUNNER_WAIT_FOR_DOCKER_IN_SECONDS
         value: "120"
      volumeMounts:
      - name: work
        mountPath: /home/runner/_work
      - name: dind-sock
        mountPath: /var/run
      securityContext:
        runAsUser: 1001
        runAsGroup: 123
        allowPrivilegeEscalation: false
        seccompProfile:
          type: RuntimeDefault
        capabilities:
          drop:
            - ALL
    - name: dind
      image: docker:dind
      args:
       - dockerd
       - --host=unix:///var/run/docker.sock
       - --group=$(DOCKER_GROUP_GID)
      env:
      - name: DOCKER_GROUP_GID
        value: "123"
      securityContext:
        privileged: true
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
        - name: dind-sock
          mountPath: /var/run
        - name: dind-externals
          mountPath: /home/runner/externals

genisd commented 9 months ago

I agree, that the in-memory configuration is not for everyone and should not be what should end up in the README

I would say that your example, stripped to the bare minimum is what should be in the documentation as the go-to baseline template

ahatzz11 commented 8 months ago

Our organization also ran into this issue on 0.8.2. We have a workflow with:

container: azul/zulu-openjdk:17-latest

The job that runs in this container then spins up other containers with test-containers. The initialization of these containers would fail because docker couldn't be found. We then used:

- name: DOCKER_HOST
  value: unix:///run/docker/docker.sock

instead of:

- name: DOCKER_HOST
  value: unix:///var/run/docker.sock

We also removed readOnly: true. After these changes it seems like docker inside the container is now able to be accessed properly. Thank you @YvesZelros!

+1 to this being updated in the README as well as in the default values.yaml file.

actions / actions-runner-controller

Support DinD in user-specified "container:" in jobs #2967

What would you like added?

Why is this needed?