actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.72k stars 1.11k forks source link

Intermittently getting "Cannot connect to the Docker daemon at unix:///var/run/docker.sock" #3794

Open AurimasNav opened 3 days ago

AurimasNav commented 3 days ago

Checks

Controller Version

0.9.3

Helm Chart Version

0.9.3

CertManager Version

1.16.1

Deployment Method

ArgoCD

cert-manager installation

cert-manager is working

Checks

Resource Definitions

values.yaml for our arc runner set helm installation:

githubConfigUrl: https://github.com/<org>
controllerServiceAccount:
  namespace: arc
  name: arc-gha-rs-controller
githubConfigSecret: arc-runner-set
maxRunners: 2
minRunners: 1
runnerGroup: "default"
runnerScaleSetName: "custom"
containerMode:
  type: dind
template:
  spec:
    hostNetwork: true
    containers:
    - name: runner
      image: some.azurecr.io/custom-actions-runner:latest
      command: ["/home/runner/run.sh"]
    imagePullSecrets:
    - name: acr-connectivity-pull
image:
  actionsRunnerImagePullSecrets:
  - name: acr-connectivity-pull

To Reproduce

Run any action which uses docker command, the error does not happen every time, I'd say it occurs 1/10 of the times, rerunning the job is usually successful.

Describe the bug

Running an action including docker command like: docker build . --file Dockerfile --tag $env:FullImageName --secret id=npm_token,env=NPM_TOKEN --build-arg NODE_ENV=production intermitently results in an error:

ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? NativeCommandExitException: /home/runner/_work/_temp/52c5c530-065c-45b1-b663-3abe54de30f1.ps1:5 Line | 5 | docker build . --file Dockerfile --tag $env:FullImageName --secret id … | ~~~~~~~~~~~~~~~~~ | Program "docker" ended with non-zero exit code: 1.

Describe the expected behavior

Being able to connect to unix:///var/run/docker.sock 100% of the runs.

Whole Controller Logs

https://gist.github.com/AurimasNav/398f849114ad71860eb0a0fcf465d691

Whole Runner Pod Logs

https://gist.github.com/AurimasNav/0660c09ba17d845591169ddf230dce48

Additional Context

In the dind container log I can see:

failed to start daemon: Error initializing network controller: error obtaining controller instance: failed to register "bridge" driver: failed to create NAT chain DOCKER: iptables failed: iptables --wait -t nat -N DOCKER: iptables: Chain already exists. (exit status 1)

Not sure why that happens or how it can be solved? Might this have something to do with my config in values.yaml

template:
  spec:
    hostNetwork: true

(if I don't specify this, my containers in actions have no internet access).

github-actions[bot] commented 3 days ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

tdorianh commented 1 day ago

@AurimasNav When you tried this without hostNetwork: true, was it in an environment with a service mesh sidecar injection like istio?

I ran into a similar issue with hostNetwork: true when 2 dind runners would come up on the same node at the same time.

One workflow would fail with

ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

and the dind container logs would have

failed to start daemon: Error initializing network controller: error obtaining controller instance: failed to register "bridge" driver: failed to create NAT chain DOCKER: iptables failed: iptables --wait -t nat -N DOCKER: iptables: Resource temporarily unavailable.

I think this is because both runners were trying to use iptables at the same time, for the host network configuration. I suspect using hostNetwork: true may result in resource contention on the node.

Anyway, I was also using hostNetwork: true because the containers didn't have internet access without it, which was actually caused by istio sidecar injection. Runners with hostNetwork: true did not receive istio sidecars, while others did. Any runner with an istio sidecar did not have internet access in containers, and removing the sidecars fixed the "no internet access without hostNetwork" issue.

AurimasNav commented 1 day ago

@AurimasNav When you tried this without hostNetwork: true, was it in an environment with a service mesh sidecar injection like istio?

I ran into a similar issue with hostNetwork: true when 2 dind runners would come up on the same node at the same time.

One workflow would fail with

ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

and the dind container logs would have

failed to start daemon: Error initializing network controller: error obtaining controller instance: failed to register "bridge" driver: failed to create NAT chain DOCKER: iptables failed: iptables --wait -t nat -N DOCKER: iptables: Resource temporarily unavailable.

I think this is because both runners were trying to use iptables at the same time, for the host network configuration. I suspect using hostNetwork: true may result in resource contention on the node.

Anyway, I was also using hostNetwork: true because the containers didn't have internet access without it, which was actually caused by istio sidecar injection. Runners with hostNetwork: true did not receive istio sidecars, while others did. Any runner with an istio sidecar did not have internet access in containers, and removing the sidecars fixed the "no internet access without hostNetwork" issue.

There is no service mesh nor any kind sidecar injection, it is a k3s install on a single node server, but I guess it could potentially be a problem with 2 runners, even though I reduced the max runners to 1 instance, I have another actions runner controller set instance for different github org, running on the same k3s.