Docker-in-docker: Add retry mechanism into the docker init script (Failed to connect to Docker)

samruddhikhandale commented 1 year ago

Sometimes docker fails to start within a container with the following error 👇

Failed to connect to Docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

As manually running /usr/local/share/docker-init.sh fixes this issue, add some retry mechanism into the docker-init script .

AndriiTsok commented 1 year ago

For the last two weeks we have been experiencing the same issue. Running the script recovers the docker instance.

kklopfenstein commented 1 year ago

I wanted to add that I've also been experiencing this issue much more frequently that normal.

samruddhikhandale commented 1 year ago

Thanks for reporting! Opened https://github.com/devcontainers/features/pull/637 which adds a retry logic to address the issue ^

AndriiTsok commented 1 year ago

@samruddhikhandale We have just tried 2.3.0 version of Docker in Docker. But we are still observing the issue with the Docker daemon.

Running the onCreateCommand from devcontainer.json...

zsh .devcontainer/on-create.sh
on-create start
Removing existing clusters...
FATA[0000] runtime failed to list nodes: docker failed to get containers with labels 'map[app:k3d]': failed to list containers: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? 
Removing existing registries...
FATA[0000] failed to list nodes: docker failed to get containers with labels 'map[app:k3d]': failed to list containers: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

samruddhikhandale commented 1 year ago

(*) Failed to start docker, retrying in 5s...

@AndriiTsok Hmm, in the creation logs do you see this log message ^? I would like to validate if the retry is actually happening. I wonder if it's conflicting (/running in parallel) with the onCreateCommand.

AndriiTsok commented 1 year ago

@samruddhikhandale Thank you for your fast reply!

We tried to create more than 30 codespaces during the last few hours to try to reproduce the issue. At the moment we are not able to produce (*) Failed to start docker, retrying in 5s... in creation logs. Earlier this morning the docker step was failing almost in every codespace.

Today, we made sure that not the prebuilt image was used and that the codespaces are created for new branches. Also, we explicitly set 2.3.0 for the D-in-D feature.

We will keep an eye on the logs and the stability and will create another issue in case it is reproducible.

AndriiTsok commented 1 year ago

@samruddhikhandale we have just managed to catch the error again. Here is the creation log: https://gist.github.com/AndriiTsok/1a62138fca79da47cb8d90db1b87ca9f

Failed to start docker line: https://gist.github.com/AndriiTsok/1a62138fca79da47cb8d90db1b87ca9f#file-gistfile1-txt-L284

Line with Docker daemon error: https://gist.github.com/AndriiTsok/1a62138fca79da47cb8d90db1b87ca9f#file-gistfile1-txt-L307

We simply trying to check the clusters and remove them for cases when we rebuilding the codespaces:

# Get the list of all existing clusters
clusters=$(k3d cluster list -o json | jq -r '.[].name')

# Iterate over each cluster and delete it
for cluster in $clusters; do
  echo "Deleting cluster $cluster"
  k3d cluster delete $cluster
done

# Create a new cluster with the given configuration
k3d cluster create --config .devcontainer/k3d/config.yaml --kubeconfig-update-default

Fails trying to get clusters.

P.S. I have observed that the Docker daemon is now failing less frequently than it was just three days ago.

AndriiTsok commented 1 year ago

@samruddhikhandale I can confirm the issue is still randomly occurring. I would say it is around 50%/50% usually rebuild helps to restore the codespace.

One of the observed log values:

[161525 ms] Start: Run: docker run --sig-proxy=false -a STDOUT -a STDERR --mount type=bind,src=/var/lib/docker/codespacemount/workspace,dst=/workspaces --mount type=volume,src=dind-var-lib-docker-0hcbhh2c7vldoj773drm1bjldnb89u0rt96sl9nju22d9ou8d14n,dst=/var/lib/docker --mount type=volume,src=minikube-config,dst=/home/vscode/.minikube --mount source=/root/.codespaces/shared,target=/workspaces/.codespaces/shared,type=bind --mount source=/var/lib/docker/codespacemount/.persistedshare,target=/workspaces/.codespaces/.persistedshare,type=bind --mount source=/.codespaces/agent/mount,target=/.codespaces/bin,type=bind --mount source=/mnt/containerTmp,target=/tmp,type=bind --mount type=bind,src=/.codespaces/agent/mount/cache,dst=/vscode -l Type=codespaces -e CODESPACES=******** -e ContainerVersion=13 -e RepositoryName=Monorepo --label ContainerVersion=13 --hostname codespaces-86fa16 --add-host codespaces-86fa16:127.0.0.1 --cap-add sys_nice --network host --privileged --entrypoint /bin/sh vsc-monorepo-9081da56175f3b1ac597257c0566d7ce76b18fbc1a048e05bdbd04f7efb0dfca-features -c echo Container started
Container started
sed: couldn't flush stdout: Device or resource busy
Outcome: success User: node WorkspaceFolder: /workspaces/Monorepo
devcontainer process exited with exit code 0
Running blocking commands...

sed: couldn't flush stdout: Device or resource busy

samruddhikhandale commented 1 year ago

@AndriiTsok Thanks for the update.

Failed to start docker line: https://gist.github.com/AndriiTsok/1a62138fca79da47cb8d90db1b87ca9f#file-gistfile1-txt-L284

Looks like the retry mechanism is triggered, we retry for 5 five times until docker daemon starts. From the logs, we can only see one such log statements, hence I can think of two things which might be happening:

There's a race condition in between retry logic and the start of the onCreateCommand (hence, the retries are not triggered and docker is not guaranteed to start)
The docker daemon started fine, hence no more retry logs but some other process killed the daemon

@AndriiTsok would it be possible to provide a sample repro (ie a sample dev container config), I'd like to experiment few things. If not, no worries, I could play around by adding docker commands within onCreateCommand.

@AndriiTsok In the meanwhile, would it be possible to add a similar retry logic (which starts docker daemon) to your onCreateCommand script? Let me know if this works!

tom-growthbox commented 1 year ago

We have experienced this issue consistently in the last 4 days. It happens during codespaces prebuild. I see the line (*) Failed to start docker, retrying in 5s... once in each of the failed jobs. Successful jobs do not have this line in the log. It didn't start with version 2.3.0, but somehow downgrading to 2.2.1 fixes it. I have not seen the error with 2.2.1.

samruddhikhandale commented 1 year ago

I wonder if adding retry logic is somehow breaking the codespace prebuilds, thanks @tom-growthbox for reporting the issue and providing a temporary solution.

@tom-growthbox would it be possible to provide a sample repro? (ie dev container config) It would help me investigate and fix the underlying issue, thanks!

tom-growthbox commented 1 year ago

I created a sample repo with similar configuration as we use. However, the prebuild doesn't fail on this one. Would need to spend some time on this to reproduce the issue.

https://github.com/tom-growthbox/prebuild-error

AndriiTsok commented 1 year ago

@samruddhikhandale I created a repro container as well: https://github.com/TRYON-Technology/Docker-in-Docker-Issue It is difficult to catch a problems. It can be 10 times straight success and then it can just start failing every second time. Seems like dockerd starting but once the on-create called - it crashes.

mandrasch commented 1 year ago

Same error here, needed 15 new codespaces instances to finally trigger and catch it. version used: 2.3.0

Published creation-log.txt and output of cat dockerd.log here: https://gist.github.com/mandrasch/3b001bccdb8ab9ab75ebd7ddc6973727

time="2023-08-16T19:07:55.475998600Z" level=error msg="containerd did not exit successfully" error="exit status 1" module=libcontainerd
failed to start daemon: error while opening volume store metadata database: timeout
x\"..." type=io.containerd.runtime.v1

Discussion and context: https://github.com/orgs/community/discussions/63776 Thanks very much for the work on this!

samruddhikhandale commented 1 year ago

@samruddhikhandale I created a repro container as well: https://github.com/TRYON-Technology/Docker-in-Docker-Issue It is difficult to catch a problems. It can be 10 times straight success and then it can just start failing every second time. Seems like dockerd starting but once the on-create called - it crashes.

@AndriiTsok Unfortunately, I don't see any dev container files added to https://github.com/TRYON-Technology/Docker-in-Docker-Issue. Am I missing something? 🤔

samruddhikhandale commented 1 year ago

Hi @mandrasch 👋

In your dev container, docker is added by the universal image. See https://github.com/devcontainers/images/blob/main/src/universal/.devcontainer/devcontainer.json#L64-L66.

Hence, adding the docker-in-docker Feature has no impact as the docker-init script is old. See https://gist.github.com/mandrasch/3b001bccdb8ab9ab75ebd7ddc6973727#file-creation-log-txt-L168. (I think this behavior should be updated.)

The prod image was built with Feature version 2.2.1 (released on Aug 3rd) which does not contain the retry logic. I'll work on releasing a new universal image which would be built with v2.3.1.

Let me know if that makes sense.

samruddhikhandale commented 1 year ago

I'll work on releasing a new universal image which would be built with v2.3.1.

Opened https://github.com/devcontainers/images/pull/705. In the meanwhile, @mandrasch can you use the dev image? (mcr.microsoft.com/devcontainers/universal:dev). Let me know if that helps!

Also, can you remove the docker-in-docker Feature from your dev container as that one is unnecessary (given that the universal image already installs it)?

samruddhikhandale commented 1 year ago

I'll work on releasing a new universal image which would be built with v2.3.1.

The image is live now!

mandrasch commented 1 year ago

Hi @samruddhikhandale! Thanks so much for explaining this! 🙏 🙏

I now removed the docker-in-docker feature from my devcontainer.json, works for me.

  "image": "mcr.microsoft.com/devcontainers/universal:2",
  "features": {
    "ghcr.io/ddev/ddev/install-ddev:latest": {}
  },

Since the bug did occur only 1 times out of 15, I can't really say if it really fixes the problem now. I'll post here again if it happens again, but hopefully that won't be the case. 👍

Question regarding this:

Is there a way to check which docker-in-docker version is used inside the universal image? (I checked https://github.com/devcontainers/images/pull/705 but could not find a commit related to a version number?) Thanks!

samruddhikhandale commented 1 year ago

Since the bug did occur only 1 times out of 15, I can't really say if it really fixes the problem now. I'll post here again if it happens again, but hopefully that won't be the case. 👍

@mandrasch One more thing, the universal image is cached in a codespace, hence, even now you will get a pull of an older image (unless you pin it to 2.5.0). I am working on updating the cache for Codespaces, but that would take a day or two.

Is there a way to check which docker-in-docker version is used inside the universal image? (I checked https://github.com/devcontainers/images/pull/705 but could not find a commit related to a version number?) Thanks!

Unfortunately, I don't think there's a direct way to find out the Feature version.

However, you can run devcontainer-info command in your codespace/dev container. Find the value for Source code release/branch. eg. Source code release/branch: v0.3.15
Go to https://github.com/devcontainers/images/actions/workflows/push.yml and find that workflow for v0.3.15. See https://github.com/devcontainers/images/actions/runs/5884996703 and find the job which builds the universal image (In this case, it's job 61)
You can search for docker-in-docker and find the Feature version. See https://github.com/devcontainers/images/actions/runs/5884996703/job/15985471064#step:6:14994

AndriiTsok commented 1 year ago

@samruddhikhandale I created a repro container as well: https://github.com/TRYON-Technology/Docker-in-Docker-Issue It is difficult to catch a problems. It can be 10 times straight success and then it can just start failing every second time. Seems like dockerd starting but once the on-create called - it crashes.

@AndriiTsok Unfortunately, I don't see any dev container files added to https://github.com/TRYON-Technology/Docker-in-Docker-Issue. Am I missing something? 🤔

Hu @samruddhikhandale I have just re-pushed the container files to https://github.com/TRYON-Technology/Docker-in-Docker-Issue

I also added an error.log showing the issue: https://github.com/TRYON-Technology/Docker-in-Docker-Issue/blob/3e6849493c9cfd073c2e45563647264385d9a07a/error.log#L2216

mandrasch commented 1 year ago

@samruddhikhandale Thanks so much for the detailed technical background information, very helpful! 🙏

samruddhikhandale commented 1 year ago

Re-opening as retry logic is reverted. See https://github.com/devcontainers/features/pull/659

Opened https://github.com/devcontainers/features/issues/660 for tracking docker failures due to "sed: couldn't flush stdout: Device or resource" errors

bherbruck commented 1 year ago

If anyone needs a quick fix - I've gotten in the habit of doing this recently, it has been working well:

#!/bin/bash

echo "Waiting for Docker to start"

# Wait for Docker to start up
while [ ! -S /var/run/docker.sock ]; do
  echo -n "."
  /usr/local/share/docker-init.sh
  sleep 1
done

echo "Docker is running!"

{
  "postCreateCommand": "bash .devcontainer/start-docker.sh"
}

.devcontainer/
├── devcontainer.json
└── start-docker.sh

darmalovan commented 1 year ago

@bherbruck I ran the docker-init.sh script and the output I got was "Setting dockerd Azure DNS", which didn't resolve the docker issue... have you stumbled into something like that?

bherbruck commented 1 year ago

@darmalovan I get that same output if I run /usr/local/share/docker-init.sh too many times

samruddhikhandale commented 1 year ago

docker-in-docker v2.4.0 includes the following changes, see https://github.com/devcontainers/features/pull/669

Adds retries for docker daemon startup
We have seen errors like sed: couldn't flush stdout: Device or resource errors which fails to start, this adds retries to fix such sed errors
Adds a workflow which runs 100 jobs for validating docker startup --> 50 jobs validates dockerd after the container is started, and 50 jobs validates within the onCreateCommand. The stress test is 🟢

Important Note: /usr/local/share/docker-init.sh which starts/retries dockerd is added to the entrypoint command. This command runs in the background and is not a blocking script for the container startup. Since it's in the background, onCreateCommand/postCreateCommand/postStartCommand could all start executing before docker is fully running. If it takes docker too long, that could introduce flakiness in those lifecycle scripts.

Opened https://github.com/devcontainers/spec/issues/299 which requests a new semantics to have "blocking" entrypoints that the CLI waits for. This way we can ensure that docker is already up and running for the mentioned ^ lifecycle scripts and is available in the container.

Closing in the favor of https://github.com/devcontainers/features/issues/671. Feel free to reopen if needed, or comment on https://github.com/devcontainers/features/issues/671 if you still run into docker not running issues. Thank you!

devcontainers / features

Docker-in-docker: Add retry mechanism into the docker init script (Failed to connect to Docker) #634