actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.57k stars 1.08k forks source link

Docker container in dind containerMode cannot connect to Github #3691

Open duchuyvp opened 1 month ago

duchuyvp commented 1 month ago

Checks

Controller Version

0.9.3

Deployment Method

ArgoCD

Checks

To Reproduce

1. Deploy the gha-runner-scale-set-controller first with default values
   Deploy `gha-runner-scale-set` chart with release name `arc-runner-set`
   1.1 At this point, Github Actions work for simple workflow file.
2. Exec into `runner` container in `action-runne-set-****-runner-****` pod
3. Run `sudo apt update && sudo apt install git -y && git clone https://github.com/actions/actions-runner-controller.git` to make sure pod has access to public internet
4. Run `docker run --rm -it alpine sh -c "apk add git && git clone https://github.com/actions/actions-runner-controller.git"`

Describe the bug

Output from step 4:

fetch https://dl-cdn.alpinelinux.org/alpine/v3.20/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.20/community/x86_64/APKINDEX.tar.gz
(1/13) Installing ca-certificates (20240705-r0)
(2/13) Installing brotli-libs (1.1.0-r2)
(3/13) Installing c-ares (1.28.1-r0)
(4/13) Installing libunistring (1.2-r0)
(5/13) Installing libidn2 (2.3.7-r0)
(6/13) Installing nghttp2-libs (1.62.1-r0)
(7/13) Installing libpsl (0.21.5-r1)
(8/13) Installing zstd-libs (1.5.6-r0)
(9/13) Installing libcurl (8.9.0-r0)
(10/13) Installing libexpat (2.6.2-r0)
(11/13) Installing pcre2 (10.43-r0)
(12/13) Installing git (2.45.2-r0)
(13/13) Installing git-init-template (2.45.2-r0)
Executing busybox-1.36.1-r29.trigger
Executing ca-certificates-20240705-r0.trigger
OK: 20 MiB in 27 packages
Cloning into 'actions-runner-controller'...
fatal: unable to access 'https://github.com/actions/actions-runner-controller.git/': SSL connection timeout

image

Describe the expected behavior

docker run command above run correctly without SSL connection timeout error

Additional Context

Yaml manifest I using to deploy `gha-runner-scale-set-controller` and `gha-runner-scale-set`

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: arc
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  source:
    repoURL: ghcr.io/actions/actions-runner-controller-charts
    targetRevision: 0.9.3
    chart: gha-runner-scale-set-controller
    helm:
      releaseName: arc
  destination:
    name: in-cluster
    namespace: arc-systems
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: false
    syncOptions:
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
      - PruneLast=false
      - ServerSideApply=true
      - ApplyOutOfSyncOnly=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
  revisionHistoryLimit: 3
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: arc-runner-set
  namespace: argocd
spec:
  project: default
  destination:
    name: in-cluster
    namespace: arc-runners
  syncPolicy:
    automated:
      selfHeal: true
      allowEmpty: false
    syncOptions:
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
      - ServerSideApply=true
      - ApplyOutOfSyncOnly=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
  revisionHistoryLimit: 3

  source:
    repoURL: ghcr.io/actions/actions-runner-controller-charts
    targetRevision: 0.9.3
    chart: gha-runner-scale-set
    helm:
      releaseName: arc-runner-set
      parameters:
        - name: controllerServiceAccount.namespace
          value: arc-systems
        - name: controllerServiceAccount.name
          value: arc-gha-rs-controller
        - name: githubConfigUrl
          value: https://github.com/<organization>
        - name: minRunners
          value: "5"
        - name: containerMode.type
          value: dind
        - name: githubConfigSecret
          value: github-app-secret

Controller Logs

https://gist.github.com/duchuyvp/9b626aec67926976f09c52d303becd1a

Runner Pod Logs

This is logs when I push this workflow file:

name: Reproduce

on:
  push:
    branches: ['*']

jobs:
  push-reproduce:
    runs-on: arc-runner-set

    steps:
      - run: sudo apt update && sudo apt install git -y
      - run: git clone https://github.com/actions/actions-runner-controller.git
      - run: docker run --rm alpine sh -c "apk add git && git clone https://github.com/actions/actions-runner-controller.git"

https://gist.github.com/duchuyvp/6a5db187bfb3657a5361bcf62b0bd4ef
github-actions[bot] commented 1 month ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

norman-zon commented 1 month ago

@duchuyvp , do you happen to run the deployment on GKE?

duchuyvp commented 1 month ago

@norman-zon I haven't test on GKE, I deployed on-prems

norman-zon commented 1 month ago

Try setting MTU for the docker daemon like:

- name: dind
          image: docker:dind
          args:
            - dockerd
            - --host=unix:///var/run/docker.sock
            - --group=$(DOCKER_GROUP_GID)
            - --mtu=1460

The default docker daemon MTU is 1500, but my host network has 1460. So aligning the docker daemon MTU fixed it for me.

duchuyvp commented 1 month ago

@norman-zon Thank you so much, your idea works for me too, I tried to patch one runner pod to add --mtu=1450 to dind container. But I don't know how to add this args when deploy with helm, since dind-container seems to be fixed in gha-runner-scale-set chart https://github.com/actions/actions-runner-controller/blob/a152741a1a6afa992f8d836a029d551984149c8f/charts/gha-runner-scale-set/templates/_helpers.tpl#L98-L116

Could you please show me how?

norman-zon commented 1 month ago

I ended up using the solution with a configMap as described in the discussion here.

You have to set

containerMode:
    type: none

and then completely specify the template for the container, as described in the values file.

This could be be easier to add to the dind container, if my PR would be merged...

stuio commented 1 month ago

Unfortunately this didn't solve our issue, which is ostensibly the same.

We have self-hosted runners in an on-premises OpenStack K8s cluster. For container actions which specify our own helper image with some useful utilities installed we can not connect to Github to clone the relevant repository. We have tried with both checkout actions, the GitHub cli and standard git with auth setup in the job.

After seeing this post we modified the DinD container as suggested passing the mtu argument and verified that this was indeed being set. And as a test followed the GP's example, trying to clone from the Runner container after installing git, which succeeded, then from the spawned helper container we tried to clone via the already installed git, which failed. All the different tests we have conducted resulted in variations of the same theme - ssl/tls timeout errors:

kubectl exec -it github-runner-scale-set-hello-world-cbr74-runner-jdr2z -- sh
Defaulted container "runner" out of: runner, dind, init-dind-externals (init)
$ sudo apt install git -y
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
<snipped>
Setting up git (1:2.46.0-0ppa1~ubuntu22.04.1) ...
Processing triggers for libc-bin (2.35-0ubuntu3.8) ...
$ git clone https://github.com/actions/actions-runner-controller.git <-- we can clone in runner container after installing git
Cloning into 'actions-runner-controller'...
remote: Enumerating objects: 12348, done.
remote: Counting objects: 100% (27/27), done.
remote: Compressing objects: 100% (26/26), done.
remote: Total 12348 (delta 11), reused 8 (delta 1), pack-reused 12321 (from 1)
Receiving objects: 100% (12348/12348), 5.44 MiB | 33.33 MiB/s, done.
Resolving deltas: 100% (8430/8430), done.
$ ls -ltr actions-runner-controller
drwxr-xr-x 23 runner runner  4096 Aug 14 06:42 actions-runner-controller
$ docker ps
CONTAINER ID   IMAGE                                                            COMMAND               CREATED              STATUS              PORTS     NAMES
cd3c11559488   ghcr.io/***/pipeline-helper:0.0.4   "tail -f /dev/null"   About a minute ago   Up About a minute             e588e3cf54e848bd99acc500aeec932e_ghcrio***pipelinehelper004_3c7f01
$ docker exec -it cd3c11559488 sh
/ # git --version <-- git already installed in container job
git version 2.45.2
/ # git clone https://github.com/actions/actions-runner-controller.git
Cloning into 'actions-runner-controller'...
fatal: unable to access 'https://github.com/actions/actions-runner-controller.git/': SSL connection timeout
Error: Process completed with exit code 128.

The specific error when using the GitHub Cli was error validating token: Get "https://api.github.com/": net/http: TLS handshake timeout

noamgreen commented 4 weeks ago

@nikola-jokic HI. i am not sure why in the original Helm there is not way to change the DinD config as its looked in the helm _helpers.tpl


{{ - define "gha-runner-scale-set.dind-container" -}}
image: docker:dind
args:
  - dockerd
  - --host=unix:///var/run/docker.sock
  - --group=$(DOCKER_GROUP_GID)
env:
  - name: DOCKER_GROUP_GID
    value: "123"
securityContext:
  privileged: true
volumeMounts:
  - name: work
    mountPath: /home/runner/_work
  - name: dind-sock
    mountPath: /var/run
  - name: dind-externals
    mountPath: /home/runner/externals
{{- end }}
na4ma4 commented 3 weeks ago

In my values file I specified (along with the init and runner container).

template:
  spec:
    containers:
    - name: dind
      image: docker:dind
      args:
        - dockerd
        - --host=unix:///var/run/docker.sock
        - --group=$(DOCKER_GROUP_GID)
        - --mtu=1400

which works for the default network, but dependabot creates it's own networks with no MTU setting, so it defaults to 1500 and dependabot breaks.

So that would fix the auto-created networks, but it won't help if you create docker networks as part of your actions.

norman-zon commented 3 weeks ago

I ended up using the solution discussed here, writing a deamon.json configMap and mounting it inside the container to /etc/docker/daemon.json.

This allow for setting

"bridge": {
  "com.docker.network.driver.mtu": "1460"

which is also used for all networks created by actions.

na4ma4 commented 3 weeks ago

I was going to update today, I saw that moby/moby#43197 has been merged (earlier this year/late last year) and that solves my issue by adding this argument --default-network-opt=bridge=com.docker.network.driver.mtu=1400.

Now when dependabot calls the docker API (not using a shell, so the shims don't help) creating a network for the updater container it now has the MTU set to 1400.

template:
  spec:
    containers:
    - name: dind
      image: docker:dind
      args:
        - dockerd
        - --host=unix:///var/run/docker.sock
        - --group=$(DOCKER_GROUP_GID)
        - --mtu=1400
        - --default-network-opt=bridge=com.docker.network.driver.mtu=1400

From the dind container in the dependabot runner pod.

$ docker network inspect dependabot-job-11050-external-network

Output (cut for size):

[
    {
        "Name": "dependabot-job-11050-external-network",
        "Id": "dff4d1a3f843634c060258f5e808050ac9861ba487a0a0c677278506321374ea",
        "Created": "2024-08-20T07:10:54.585512615Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": { ... },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": { ... }
        },
        "Options": {
            "com.docker.network.driver.mtu": "1400"
        },
        "Labels": {}
    }
]
norman-zon commented 3 weeks ago

Maybe these two options (container args and ConfigMap) should be added to the docs, considering how many reactions this issue got?