actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.72k stars 1.11k forks source link

Runners created with actions-runner-controller in we have a lot of pods with errors: "Cannot connect to the Docker daemon at unix:///run/docker.sock. Is the docker daemon running?" #3257

Open viniciusesteter opened 9 months ago

viniciusesteter commented 9 months ago

Checks

Controller Version

latest

Helm Chart Version

0.27.6

CertManager Version

1.13.1

Deployment Method

Helm

cert-manager installation

Installed ok by Chart.yaml

Checks

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1 
kind: RunnerDeployment
metadata:
  {{- if hasSuffix "-dev" .Release.Namespace  }}
  name: {{ .Values.runnerDeploymentDev.name }}
  namespace: {{ .Release.Namespace }}
  {{- end  }}
  {{- if hasSuffix "-prd" .Release.Namespace }}
  name: {{ .Values.runnerDeploymentPrd.name }}
  namespace: {{ .Release.Namespace }}
  {{- end }}
spec:
  {{- if hasSuffix "-dev" .Release.Namespace  }}
  replicas: {{ .Values.runnerDeploymentDev.replicas }}
  {{- end }}
  {{- if hasSuffix "-prd" .Release.Namespace }}
  replicas: {{ .Values.runnerDeploymentPrd.replicas }}
  {{- end }}
  template:
    spec:
      {{- if hasSuffix "-dev" .Release.Namespace  }}
      image: {{ .Values.runnerDeploymentDev.image }} ## Alterar para repositório de DEV
      {{- end }}
      {{- if hasSuffix "-prd" .Release.Namespace }}
      image: {{ .Values.runnerDeploymentPrd.image }} ## Alterar para repositório de Prd
      {{- end }}
      organization: company-a
      {{- if hasSuffix "-dev" .Release.Namespace  }}
      labels:
        {{- range .Values.runnerDeploymentDev.labels }}
        {{ "-" }} {{ . }}
        {{- end }}
      {{- end }}
      {{- if hasSuffix "-prd" .Release.Namespace }}
      labels:
        {{- range .Values.runnerDeploymentPrd.labels }}
        {{ "-" }} {{ . }}
        {{- end }}
      {{- end }}
      env:
        - name: teste
          {{- if hasSuffix "-dev" .Release.Namespace  }}
          value: a
          {{- end }}
          {{- if hasSuffix "-prd" .Release.Namespace  }}
          value: b
          {{- end }}
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  {{- if hasSuffix "-dev" .Release.Namespace  }}
  name: {{ .Values.HpaDev.name }}
  namespace: {{ .Release.Namespace }}
  {{- end }}
  {{- if hasSuffix "-prd" .Release.Namespace }} 
  name: {{ .Values.HpaPrd.name }}
  namespace: {{ .Release.Namespace }}
  {{- end }}
spec:
  scaleTargetRef:
    kind: RunnerDeployment
    {{- if hasSuffix "-dev" .Release.Namespace  }}
    name: {{ .Values.HpaDev.nameRunner }}
    {{- end }}
    {{- if hasSuffix "-prd" .Release.Namespace }} 
    name: {{ .Values.HpaPrd.nameRunner }}
    {{- end }}
  {{- if hasSuffix "-dev" .Release.Namespace  }}
  minReplicas: {{ .Values.HpaDev.minReplicas }}
  maxReplicas: {{ .Values.HpaDev.maxReplicas }}
  scaleDownDelaySecondsAfterScaleOut: {{ .Values.HpaDev.scaleDownDelaySecondsAfterScaleOut }}
  metrics:
  - type: {{ .Values.HpaDev.type }}
    scaleUpThreshold: '{{ .Values.HpaDev.scaleUpThreshold }}'   # The percentage of busy runners at which the number of desired runners are re-evaluated to scale up
    scaleDownThreshold: '{{ .Values.HpaDev.scaleDownThreshold }}'  # The percentage of busy runners at which the number of desired runners are re-evaluated to scale down
    scaleUpAdjustment: {{ .Values.HpaDev.scaleUpAdjustment }}        # The scale up runner count added to desired count
    scaleDownAdjustment: {{ .Values.HpaDev.scaleDownAdjustment }}     # The scale down runner count subtracted from the desired count
    # Podemos usar os parametros de Factor ou Adjustment acima, mas não os dois juntos.
    # scaleUpFactor: {{ .Values.HpaDev.scaleUpFactor }}        # The scale up runner count added to desired count
    # scaleDownFactor: {{ .Values.HpaDev.scaleDownFactor }}     # The scale down runner count subtracted from the desired count

  {{- end }}
  {{- if hasSuffix "-prd" .Release.Namespace }}
  minReplicas: {{ .Values.HpaPrd.minReplicas }}
  maxReplicas: {{ .Values.HpaPrd.maxReplicas }}
  scaleDownDelaySecondsAfterScaleOut: {{ .Values.HpaPrd.scaleDownDelaySecondsAfterScaleOut }}
  metrics:
  - type: {{ .Values.HpaPrd.type }}
    scaleUpThreshold: '{{ .Values.HpaPrd.scaleUpThreshold }}'   # The percentage of busy runners at which the number of desired runners are re-evaluated to scale up
    scaleDownThreshold: '{{ .Values.HpaPrd.scaleDownThreshold }}'  # The percentage of busy runners at which the number of desired runners are re-evaluated to scale down
    scaleUpAdjustment: {{ .Values.HpaPrd.scaleUpAdjustment }}       # The scale up runner count added to desired count
    scaleDownAdjustment: {{ .Values.HpaPrd.scaleDownAdjustment }}     # The scale down runner count subtracted from the desired count
    # Podemos usar os parametros de Factor ou Adjustment acima, mas não os dois juntos.
    # scaleUpFactor: {{ .Values.HpaPrd.scaleUpFactor }}       # The scale up runner count added to desired count
    # scaleDownFactor: {{ .Values.HpaPrd.scaleDownFactor }}     # The scale down runner count subtracted from the desired count
  {{- end }}

To Reproduce

1. Create a runner
2. Watch runner container logs

Describe the bug

I'm using a GKE version: 1.26.10-gke.1101000. In my Dockerfile, I'm using: FROM summerwind/actions-runner:latest.

In values.yaml, I'm using:

image:
  repository: "summerwind/actions-runner-controller"
  actionsRunnerRepositoryAndTag: "summerwind/actions-runner:latest"
  dindSidecarRepositoryAndTag: "docker:dind"
  pullPolicy: IfNotPresent
  # The default image-pull secrets name for self-hosted runner container.
  # It's added to spec.ImagePullSecrets of self-hosted runner pods. 
  actionsRunnerImagePullSecrets: []

But when deploy is done, in GKE and get a lot of pods, with error: "Cannot connect to the Docker daemon at unix:///run/docker.sock. Is the docker daemon running?"

The pods are restarting with error in container "docker" with this message: "Cannot connect to the Docker daemon at unix:///run/docker.sock. Is the docker daemon running?". It died and start new with the same problem.

I've already follow this issue: 2490, but doesn't work.

Could help me please?

Describe the expected behavior

Doesn't get this situation with error, and running normally.

Whole Controller Logs

Nothing logs in controller with errors.

Whole Runner Pod Logs

In pods I got the same error: "Cannot connect to the Docker daemon at unix:///run/docker.sock. Is the docker daemon running?"

I've tried to change $DOCKER_HOST to DOCKER_HOST="tcp://localhost:2375", but when I open the running that I can, a do echo $DOCKER_HOST and my response is: unix:///run/docker.sock. I don't think this can be the error.

Additional Context

No response

github-actions[bot] commented 9 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

asafhm commented 9 months ago

I think I'm facing the same issue as well. I it happened to me a month ago and it went away on its own but I couldn't figure it out. The logs of the docker container in the runner pods all emit this:

cat: can't open '/proc/net/arp_tables_names': No such file or directory
iptables v1.8.10 (nf_tables)
time="2024-02-02T14:36:08.349688382Z" level=info msg="Starting up"
time="2024-02-02T14:36:08.350965386Z" level=info msg="containerd not running, starting managed containerd"
time="2024-02-02T14:36:08.351685472Z" level=info msg="started new containerd process" address=/var/run/docker/containerd/containerd.sock module=libcontainerd pid=30
time="2024-02-02T14:36:08.373648316Z" level=info msg="starting containerd" revision=7c3aca7a610df76212171d200ca3811ff6096eb8 version=v1.7.13
time="2024-02-02T14:36:08.392594430Z" level=info msg="loading plugin \"io.containerd.event.v1.exchange\"..." type=io.containerd.event.v1
time="2024-02-02T14:36:08.392636213Z" level=info msg="loading plugin \"io.containerd.internal.v1.opt\"..." type=io.containerd.internal.v1
time="2024-02-02T14:36:08.392898621Z" level=info msg="loading plugin \"io.containerd.warning.v1.deprecations\"..." type=io.containerd.warning.v1
time="2024-02-02T14:36:08.392917009Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.blockfile\"..." type=io.containerd.snapshotter.v1
time="2024-02-02T14:36:08.392969909Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.blockfile\"..." error="no scratch file generator: skip plugin" type=io.containerd.snapshotter.v1
time="2024-02-02T14:36:08.392983451Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.devmapper\"..." type=io.containerd.snapshotter.v1
time="2024-02-02T14:36:08.392992588Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.devmapper" error="devmapper not configured"
time="2024-02-02T14:36:08.393000482Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.native\"..." type=io.containerd.snapshotter.v1
time="2024-02-02T14:36:08.393063886Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.overlayfs\"..." type=io.containerd.snapshotter.v1
time="2024-02-02T14:36:08.393264992Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.aufs\"..." type=io.containerd.snapshotter.v1
time="2024-02-02T14:36:08.397803212Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.aufs\"..." error="aufs is not supported (modprobe aufs failed: exit status 1 \"ip: can't find device 'aufs'\\nmodprobe: can't change directory to '/lib/modules': No such file or directory\\n\"): skip plugin" type=io.containerd.snapshotter.v1
time="2024-02-02T14:36:08.397832591Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.zfs\"..." type=io.containerd.snapshotter.v1
time="2024-02-02T14:36:08.398031842Z" level=info msg="skip loading plugin \"io.containerd.snapshotter.v1.zfs\"..." error="path /var/lib/docker/containerd/daemon/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
time="2024-02-02T14:36:08.398046545Z" level=info msg="loading plugin \"io.containerd.content.v1.content\"..." type=io.containerd.content.v1
time="2024-02-02T14:36:08.398150598Z" level=info msg="loading plugin \"io.containerd.metadata.v1.bolt\"..." type=io.containerd.metadata.v1
time="2024-02-02T14:36:08.398212774Z" level=warning msg="could not use snapshotter devmapper in metadata plugin" error="devmapper not configured"
time="2024-02-02T14:36:08.398230252Z" level=info msg="metadata content store policy set" policy=shared
time="2024-02-02T14:36:08.445396814Z" level=info msg="loading plugin \"io.containerd.gc.v1.scheduler\"..." type=io.containerd.gc.v1
time="2024-02-02T14:36:08.445471715Z" level=info msg="loading plugin \"io.containerd.differ.v1.walking\"..." type=io.containerd.differ.v1
time="2024-02-02T14:36:08.445500464Z" level=info msg="loading plugin \"io.containerd.lease.v1.manager\"..." type=io.containerd.lease.v1
time="2024-02-02T14:36:08.445576869Z" level=info msg="loading plugin \"io.containerd.streaming.v1.manager\"..." type=io.containerd.streaming.v1
time="2024-02-02T14:36:08.445618783Z" level=info msg="loading plugin \"io.containerd.runtime.v1.linux\"..." type=io.containerd.runtime.v1
time="2024-02-02T14:36:08.445781283Z" level=info msg="loading plugin \"io.containerd.monitor.v1.cgroups\"..." type=io.containerd.monitor.v1
time="2024-02-02T14:36:08.446305234Z" level=info msg="loading plugin \"io.containerd.runtime.v2.task\"..." type=io.containerd.runtime.v2
time="2024-02-02T14:36:08.446589823Z" level=info msg="loading plugin \"io.containerd.runtime.v2.shim\"..." type=io.containerd.runtime.v2
time="2024-02-02T14:36:08.446619509Z" level=info msg="loading plugin \"io.containerd.sandbox.store.v1.local\"..." type=io.containerd.sandbox.store.v1
time="2024-02-02T14:36:08.446666322Z" level=info msg="loading plugin \"io.containerd.sandbox.controller.v1.local\"..." type=io.containerd.sandbox.controller.v1
time="2024-02-02T14:36:08.446705283Z" level=info msg="loading plugin \"io.containerd.service.v1.containers-service\"..." type=io.containerd.service.v1
time="2024-02-02T14:36:08.446759587Z" level=info msg="loading plugin \"io.containerd.service.v1.content-service\"..." type=io.containerd.service.v1
time="2024-02-02T14:36:08.446780137Z" level=info msg="loading plugin \"io.containerd.service.v1.diff-service\"..." type=io.containerd.service.v1
time="2024-02-02T14:36:08.446806016Z" level=info msg="loading plugin \"io.containerd.service.v1.images-service\"..." type=io.containerd.service.v1
time="2024-02-02T14:36:08.446835246Z" level=info msg="loading plugin \"io.containerd.service.v1.introspection-service\"..." type=io.containerd.service.v1
time="2024-02-02T14:36:08.446858358Z" level=info msg="loading plugin \"io.containerd.service.v1.namespaces-service\"..." type=io.containerd.service.v1
time="2024-02-02T14:36:08.446883787Z" level=info msg="loading plugin \"io.containerd.service.v1.snapshots-service\"..." type=io.containerd.service.v1
time="2024-02-02T14:36:08.446902822Z" level=info msg="loading plugin \"io.containerd.service.v1.tasks-service\"..." type=io.containerd.service.v1
time="2024-02-02T14:36:08.446932581Z" level=info msg="loading plugin \"io.containerd.grpc.v1.containers\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.446961217Z" level=info msg="loading plugin \"io.containerd.grpc.v1.content\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.446981273Z" level=info msg="loading plugin \"io.containerd.grpc.v1.diff\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.446997347Z" level=info msg="loading plugin \"io.containerd.grpc.v1.events\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447016883Z" level=info msg="loading plugin \"io.containerd.grpc.v1.images\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447036957Z" level=info msg="loading plugin \"io.containerd.grpc.v1.introspection\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447052236Z" level=info msg="loading plugin \"io.containerd.grpc.v1.leases\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447070724Z" level=info msg="loading plugin \"io.containerd.grpc.v1.namespaces\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447087998Z" level=info msg="loading plugin \"io.containerd.grpc.v1.sandbox-controllers\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447107438Z" level=info msg="loading plugin \"io.containerd.grpc.v1.sandboxes\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447123714Z" level=info msg="loading plugin \"io.containerd.grpc.v1.snapshots\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447148047Z" level=info msg="loading plugin \"io.containerd.grpc.v1.streaming\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447166979Z" level=info msg="loading plugin \"io.containerd.grpc.v1.tasks\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447198670Z" level=info msg="loading plugin \"io.containerd.transfer.v1.local\"..." type=io.containerd.transfer.v1
time="2024-02-02T14:36:08.447442412Z" level=info msg="loading plugin \"io.containerd.grpc.v1.transfer\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447530474Z" level=info msg="loading plugin \"io.containerd.grpc.v1.version\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447564804Z" level=info msg="loading plugin \"io.containerd.internal.v1.restart\"..." type=io.containerd.internal.v1
time="2024-02-02T14:36:08.447645525Z" level=info msg="loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." type=io.containerd.tracing.processor.v1
time="2024-02-02T14:36:08.447680777Z" level=info msg="skip loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." error="no OpenTelemetry endpoint: skip plugin" type=io.containerd.tracing.processor.v1
time="2024-02-02T14:36:08.447698141Z" level=info msg="loading plugin \"io.containerd.internal.v1.tracing\"..." type=io.containerd.internal.v1
time="2024-02-02T14:36:08.447725487Z" level=info msg="skipping tracing processor initialization (no tracing plugin)" error="no OpenTelemetry endpoint: skip plugin"
time="2024-02-02T14:36:08.447855630Z" level=info msg="loading plugin \"io.containerd.grpc.v1.healthcheck\"..." type=io.containerd.grpc.v1
time="2024-02-02T14:36:08.447879733Z" level=info msg="loading plugin \"io.containerd.nri.v1.nri\"..." type=io.containerd.nri.v1
time="2024-02-02T14:36:08.447899819Z" level=info msg="NRI interface is disabled by configuration."
time="2024-02-02T14:36:08.448219892Z" level=info msg=serving... address=/var/run/docker/containerd/containerd-debug.sock
time="2024-02-02T14:36:08.448288542Z" level=info msg=serving... address=/var/run/docker/containerd/containerd.sock.ttrpc
time="2024-02-02T14:36:08.448345253Z" level=info msg=serving... address=/var/run/docker/containerd/containerd.sock
time="2024-02-02T14:36:08.448380520Z" level=info msg="containerd successfully booted in 0.075643s"
time="2024-02-02T14:36:11.271697198Z" level=info msg="Loading containers: start."
time="2024-02-02T14:36:11.366705804Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
time="2024-02-02T14:36:11.367239329Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd
time="2024-02-02T14:36:11.367283663Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
failed to start daemon: Error initializing network controller: error creating default "bridge" network: Failed to Setup IP tables: Unable to enable NAT rule:  (iptables failed: iptables --wait -t nat -I POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE: Warning: Extension MASQUERADE revision 0 not supported, missing kernel module?
iptables v1.8.10 (nf_tables):  CHAIN_ADD failed (No such file or directory): chain POSTROUTING
 (exit status 4))

I'm using GKE version 1.28, with the default dind container image in the helm chart.

asafhm commented 9 months ago

Wonder if this has anything to do with the recent fix GKE has released for CVE-2023-6817

viniciusesteter commented 9 months ago

Wonder if this has anything to do with the recent fix GKE has released for CVE-2023-6817

I don’t think so. Because this errors in my cluster happening since 4/5 months ago.

asafhm commented 9 months ago

Ended up following this workaround which made dind work again: https://github.com/actions/actions-runner-controller/issues/3159#issuecomment-1906905610

Still think dind needs to address this.

viniciusesteter commented 9 months ago

But where manifest in helm, Can I put this arguments? Because I don't have a argument that has container docker. And I actually used image: summerwind/actions-runner:latest in my Dockerfile and summerwind/actions-runner:latest in my values.yaml from deployment.yaml helm.

asafhm commented 9 months ago

I agree, that's tricky. I actually transitioned to the new runner-scale-set operator, where you can control the pod template, including the dind sidecar container.

jctrouble commented 9 months ago

I agree, that's tricky. I actually transitioned to the new runner-scale-set operator, where you can control the pod template, including the dind sidecar container.

@asafhm Would you be willing to share the snippet of your values.yaml (or helm command) where you specified the dind container with the workaround environment variable?

asafhm commented 9 months ago

@jctrouble Here's a portion of the values.yaml I use for the gha-runner-scale-set chart:

template:
  spec:
    initContainers:
      - name: init-dind-externals
        image: ghcr.io/actions/actions-runner:latest
        command:
          ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
        volumeMounts:
          - name: dind-externals
            mountPath: /home/runner/tmpDir
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        imagePullPolicy: Always
        command: ["/home/runner/run.sh"]
        resources:
          limits:
            cpu: 400m
            memory: 512Mi
          requests:
            cpu: 200m
            memory: 256Mi
        env:
          - name: DOCKER_HOST
            value: unix:///run/docker/docker.sock
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: dind-sock
            mountPath: /run/docker
            readOnly: true
      - name: dind
        image: docker:dind
        args:
          - dockerd
          - --host=unix:///run/docker/docker.sock
          - --group=$(DOCKER_GROUP_GID)
        env:
          - name: DOCKER_GROUP_GID
            value: "123"
          # TODO: Once this issue is fixed (https://github.com/actions/actions-runner-controller/issues/3159),
          # we can switch to containerMode.type=dind and keep only the "runner" container specs and remove the "dind" container, init containers and volumes parts from the values.
          - name: DOCKER_IPTABLES_LEGACY
            value: "1"
        securityContext:
          privileged: true
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: dind-sock
            mountPath: /run/docker
          - name: dind-externals
            mountPath: /home/runner/externals
    volumes:
      - name: work
        emptyDir: {}
      - name: dind-sock
        emptyDir: {}
      - name: dind-externals
        emptyDir: {}

The reason I added a lot more here than just the env var part is because the docs specify that if you need to modify something in the dind container, you have to have all its configuration in your values file and edit it there. Not a clean solution yet I'm afraid, but at least it works well.

rekha-prakash-maersk commented 9 months ago

Hi @asafhm I have tried your workaround but still facing the same issue. the issue started after upgrading new scale-set to its latest version. any other options to try ? Thanks!

The runner is starting fine, but the error appears if I run a workflow which has docker build step., so I am a bit clueless!

asafhm commented 9 months ago

@rekha-prakash-maersk Did you verify that runner pods that come up have said env var in the dind container spec? Also did you check the dind container logs? Cannot connect to the Docker daemon at unix:///run/docker.sock can result from a number of reasons.

rekha-prakash-maersk commented 8 months ago

Hi @asafhm , I found that dind container needed more resources for the docker build that was executed. thanks for the help!

sravula84 commented 7 months ago

we are facing similar issue - time="2024-04-11T22:08:59.214409763Z" level=info msg="Loading containers: start." │ │ time="2024-04-11T22:08:59.337082693Z" level=info msg="stopping event stream following graceful shutdown" error="" module=libcontainerd namespace=moby │ │ time="2024-04-11T22:08:59.337523532Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd │ │ time="2024-04-11T22:08:59.337584737Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby │ │ failed to start daemon: Error initializing network controller: error obtaining controller instance: failed to register "bridge" driver: unable to add return rule in DOCKER-ISOLATION-STAGE-1 chain: (iptables failed: iptables --wait -A DOCKER-ISOLATION-STAGE-1 -j RETURN: iptables v1.8.10 (nf_tables): RULE_APPEND failed (No such file or directory): rule in chain DOCKER-ISOLATION-STAGE-1 │ │ (exit status 4)) │ │ Stream closed EOF for arc-runners/arc-runner-set-qwwpf-runner-nffvc (dind)

any suggestion @rekha-prakash-maersk @asafhm

marc-barry commented 6 months ago

I'm have the same issue on Google Cloud Platform on GKE when simply using:

containerMode:
   type: "dind"

I haven't adjusted any of the values.

rekha-prakash-maersk commented 6 months ago

Hi @marc-barry , I have allocated more resource to CPU and memory for dind container like below, which resolved the issue for me

- name: dind
        image: docker:dind
        args:
          - dockerd
          - --host=unix:///run/docker/docker.sock
          - --group=$(DOCKER_GROUP_GID)
        env:
          - name: DOCKER_GROUP_GID
            value: "123"
        resources:
          requests:
            memory: "500Mi"
            cpu: "300m"
          limits:
            memory: "500Mi"
            cpu: "300m"
        securityContext:
          privileged: true
marc-barry commented 6 months ago

@rekha-prakash-maersk thanks for that information. We've decided to move away from using runners on Kubernetes as the documentation isn't yet fully complete and we don't want to spend our time fighting infrastructure problems like we are experiencing with this controller. The concepts and ideas are pretty sound but the execution is challenging. For the time being, we have gone to bare VMs running Debian on GCP on both t2a-standard-x for our Arm64 builds and t2d-standard-x for our Amd64 builds. We then have an image template that simply has Docker installed on the machine and the runner started with Systemd. I was able to get this all running in under an hour versus the challenges faced with the Actions Runner Controller.

GitHub Actions is super convenient and that's why we use it. But if I find the need to bring our runners more and more then I'll switch us to Buildkite as I feel like their BYOC is a bit more developed (and I have a lot of experience with it).

sravula84 commented 6 months ago

@rekha-prakash-maersk do we need to comment below sections ? containerMode: type: "dind"

Nuru commented 5 months ago

I am seeing this, too, intermittently, running on AWS EKS, Kubernetes v1.29.3.

/usr/bin/docker build ...
ERROR: Cannot connect to the Docker daemon at unix:///run/docker/docker.sock. Is the docker daemon running?
casey-robertson-paypal commented 3 months ago

I am seeing this, too, intermittently, running on AWS EKS, Kubernetes v1.29.3.

/usr/bin/docker build ...
ERROR: Cannot connect to the Docker daemon at unix:///run/docker/docker.sock. Is the docker daemon running?

Same here. It's a very small percentage of jobs but I have yet to figure out why.