Closed samruddhikhandale closed 1 year ago
For the last two weeks we have been experiencing the same issue. Running the script recovers the docker instance.
I wanted to add that I've also been experiencing this issue much more frequently that normal.
Thanks for reporting! Opened https://github.com/devcontainers/features/pull/637 which adds a retry logic to address the issue ^
@samruddhikhandale We have just tried 2.3.0 version of Docker in Docker. But we are still observing the issue with the Docker daemon.
Running the onCreateCommand from devcontainer.json...
zsh .devcontainer/on-create.sh
on-create start
Removing existing clusters...
FATA[0000] runtime failed to list nodes: docker failed to get containers with labels 'map[app:k3d]': failed to list containers: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Removing existing registries...
FATA[0000] failed to list nodes: docker failed to get containers with labels 'map[app:k3d]': failed to list containers: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
(*) Failed to start docker, retrying in 5s...
@AndriiTsok Hmm, in the creation logs do you see this log message ^? I would like to validate if the retry is actually happening. I wonder if it's conflicting (/running in parallel) with the onCreateCommand
.
@samruddhikhandale Thank you for your fast reply!
We tried to create more than 30 codespaces during the last few hours to try to reproduce the issue. At the moment we are not able to produce (*) Failed to start docker, retrying in 5s...
in creation logs. Earlier this morning the docker step was failing almost in every codespace.
Today, we made sure that not the prebuilt image was used and that the codespaces are created for new branches. Also, we explicitly set 2.3.0 for the D-in-D feature.
We will keep an eye on the logs and the stability and will create another issue in case it is reproducible.
@samruddhikhandale we have just managed to catch the error again. Here is the creation log: https://gist.github.com/AndriiTsok/1a62138fca79da47cb8d90db1b87ca9f
Failed to start docker line: https://gist.github.com/AndriiTsok/1a62138fca79da47cb8d90db1b87ca9f#file-gistfile1-txt-L284
Line with Docker daemon error: https://gist.github.com/AndriiTsok/1a62138fca79da47cb8d90db1b87ca9f#file-gistfile1-txt-L307
We simply trying to check the clusters and remove them for cases when we rebuilding the codespaces:
# Get the list of all existing clusters
clusters=$(k3d cluster list -o json | jq -r '.[].name')
# Iterate over each cluster and delete it
for cluster in $clusters; do
echo "Deleting cluster $cluster"
k3d cluster delete $cluster
done
# Create a new cluster with the given configuration
k3d cluster create --config .devcontainer/k3d/config.yaml --kubeconfig-update-default
Fails trying to get clusters.
P.S. I have observed that the Docker daemon is now failing less frequently than it was just three days ago.
@samruddhikhandale I can confirm the issue is still randomly occurring. I would say it is around 50%/50% usually rebuild helps to restore the codespace.
One of the observed log values:
[161525 ms] Start: Run: docker run --sig-proxy=false -a STDOUT -a STDERR --mount type=bind,src=/var/lib/docker/codespacemount/workspace,dst=/workspaces --mount type=volume,src=dind-var-lib-docker-0hcbhh2c7vldoj773drm1bjldnb89u0rt96sl9nju22d9ou8d14n,dst=/var/lib/docker --mount type=volume,src=minikube-config,dst=/home/vscode/.minikube --mount source=/root/.codespaces/shared,target=/workspaces/.codespaces/shared,type=bind --mount source=/var/lib/docker/codespacemount/.persistedshare,target=/workspaces/.codespaces/.persistedshare,type=bind --mount source=/.codespaces/agent/mount,target=/.codespaces/bin,type=bind --mount source=/mnt/containerTmp,target=/tmp,type=bind --mount type=bind,src=/.codespaces/agent/mount/cache,dst=/vscode -l Type=codespaces -e CODESPACES=******** -e ContainerVersion=13 -e RepositoryName=Monorepo --label ContainerVersion=13 --hostname codespaces-86fa16 --add-host codespaces-86fa16:127.0.0.1 --cap-add sys_nice --network host --privileged --entrypoint /bin/sh vsc-monorepo-9081da56175f3b1ac597257c0566d7ce76b18fbc1a048e05bdbd04f7efb0dfca-features -c echo Container started
Container started
sed: couldn't flush stdout: Device or resource busy
Outcome: success User: node WorkspaceFolder: /workspaces/Monorepo
devcontainer process exited with exit code 0
Running blocking commands...
sed: couldn't flush stdout: Device or resource busy
@AndriiTsok Thanks for the update.
Failed to start docker line: https://gist.github.com/AndriiTsok/1a62138fca79da47cb8d90db1b87ca9f#file-gistfile1-txt-L284
Looks like the retry mechanism is triggered, we retry for 5 five times until docker daemon starts. From the logs, we can only see one such log statements, hence I can think of two things which might be happening:
@AndriiTsok would it be possible to provide a sample repro (ie a sample dev container config), I'd like to experiment few things. If not, no worries, I could play around by adding docker commands within onCreateCommand.
@AndriiTsok In the meanwhile, would it be possible to add a similar retry logic (which starts docker daemon) to your onCreateCommand script? Let me know if this works!
We have experienced this issue consistently in the last 4 days. It happens during codespaces prebuild. I see the line (*) Failed to start docker, retrying in 5s...
once in each of the failed jobs. Successful jobs do not have this line in the log. It didn't start with version 2.3.0, but somehow downgrading to 2.2.1 fixes it. I have not seen the error with 2.2.1.
I wonder if adding retry logic is somehow breaking the codespace prebuilds, thanks @tom-growthbox for reporting the issue and providing a temporary solution.
@tom-growthbox would it be possible to provide a sample repro? (ie dev container config) It would help me investigate and fix the underlying issue, thanks!
I created a sample repo with similar configuration as we use. However, the prebuild doesn't fail on this one. Would need to spend some time on this to reproduce the issue.
@samruddhikhandale I created a repro container as well: https://github.com/TRYON-Technology/Docker-in-Docker-Issue It is difficult to catch a problems. It can be 10 times straight success and then it can just start failing every second time. Seems like dockerd starting but once the on-create called - it crashes.
Same error here, needed 15 new codespaces instances to finally trigger and catch it. version used: 2.3.0
Published creation-log.txt
and output of cat dockerd.log
here:
https://gist.github.com/mandrasch/3b001bccdb8ab9ab75ebd7ddc6973727
time="2023-08-16T19:07:55.475998600Z" level=error msg="containerd did not exit successfully" error="exit status 1" module=libcontainerd
failed to start daemon: error while opening volume store metadata database: timeout
x\"..." type=io.containerd.runtime.v1
Discussion and context: https://github.com/orgs/community/discussions/63776 Thanks very much for the work on this!
@samruddhikhandale I created a repro container as well: https://github.com/TRYON-Technology/Docker-in-Docker-Issue It is difficult to catch a problems. It can be 10 times straight success and then it can just start failing every second time. Seems like dockerd starting but once the on-create called - it crashes.
@AndriiTsok Unfortunately, I don't see any dev container files added to https://github.com/TRYON-Technology/Docker-in-Docker-Issue. Am I missing something? 🤔
Hi @mandrasch 👋
In your dev container, docker is added by the universal
image. See https://github.com/devcontainers/images/blob/main/src/universal/.devcontainer/devcontainer.json#L64-L66.
Hence, adding the docker-in-docker
Feature has no impact as the docker-init script is old. See https://gist.github.com/mandrasch/3b001bccdb8ab9ab75ebd7ddc6973727#file-creation-log-txt-L168. (I think this behavior should be updated.)
The prod image was built with Feature version 2.2.1 (released on Aug 3rd) which does not contain the retry logic. I'll work on releasing a new universal
image which would be built with v2.3.1.
Let me know if that makes sense.
I'll work on releasing a new universal image which would be built with v2.3.1.
Opened https://github.com/devcontainers/images/pull/705. In the meanwhile, @mandrasch can you use the dev image? (mcr.microsoft.com/devcontainers/universal:dev
). Let me know if that helps!
Also, can you remove the docker-in-docker
Feature from your dev container as that one is unnecessary (given that the universal image already installs it)?
I'll work on releasing a new universal image which would be built with v2.3.1.
The image is live now!
Hi @samruddhikhandale! Thanks so much for explaining this! 🙏 🙏
I now removed the docker-in-docker
feature from my devcontainer.json, works for me.
"image": "mcr.microsoft.com/devcontainers/universal:2",
"features": {
"ghcr.io/ddev/ddev/install-ddev:latest": {}
},
Since the bug did occur only 1 times out of 15, I can't really say if it really fixes the problem now. I'll post here again if it happens again, but hopefully that won't be the case. 👍
Question regarding this:
Is there a way to check which docker-in-docker
version is used inside the universal image? (I checked https://github.com/devcontainers/images/pull/705 but could not find a commit related to a version number?) Thanks!
Since the bug did occur only 1 times out of 15, I can't really say if it really fixes the problem now. I'll post here again if it happens again, but hopefully that won't be the case. 👍
@mandrasch One more thing, the universal
image is cached in a codespace, hence, even now you will get a pull of an older image (unless you pin it to 2.5.0
). I am working on updating the cache for Codespaces, but that would take a day or two.
Is there a way to check which docker-in-docker version is used inside the universal image? (I checked https://github.com/devcontainers/images/pull/705 but could not find a commit related to a version number?) Thanks!
Unfortunately, I don't think there's a direct way to find out the Feature version.
However, you can run devcontainer-info
command in your codespace/dev container. Find the value for Source code release/branch
. eg. Source code release/branch: v0.3.15
Go to https://github.com/devcontainers/images/actions/workflows/push.yml and find that workflow for v0.3.15
. See https://github.com/devcontainers/images/actions/runs/5884996703 and find the job which builds the universal
image (In this case, it's job 61)
You can search for docker-in-docker
and find the Feature version. See https://github.com/devcontainers/images/actions/runs/5884996703/job/15985471064#step:6:14994
@samruddhikhandale I created a repro container as well: https://github.com/TRYON-Technology/Docker-in-Docker-Issue It is difficult to catch a problems. It can be 10 times straight success and then it can just start failing every second time. Seems like dockerd starting but once the on-create called - it crashes.
@AndriiTsok Unfortunately, I don't see any dev container files added to https://github.com/TRYON-Technology/Docker-in-Docker-Issue. Am I missing something? 🤔
Hu @samruddhikhandale I have just re-pushed the container files to https://github.com/TRYON-Technology/Docker-in-Docker-Issue
I also added an error.log showing the issue: https://github.com/TRYON-Technology/Docker-in-Docker-Issue/blob/3e6849493c9cfd073c2e45563647264385d9a07a/error.log#L2216
@samruddhikhandale Thanks so much for the detailed technical background information, very helpful! 🙏
Re-opening as retry logic is reverted. See https://github.com/devcontainers/features/pull/659
Opened https://github.com/devcontainers/features/issues/660 for tracking docker failures due to "sed: couldn't flush stdout: Device or resource" errors
If anyone needs a quick fix - I've gotten in the habit of doing this recently, it has been working well:
#!/bin/bash
echo "Waiting for Docker to start"
# Wait for Docker to start up
while [ ! -S /var/run/docker.sock ]; do
echo -n "."
/usr/local/share/docker-init.sh
sleep 1
done
echo "Docker is running!"
{
"postCreateCommand": "bash .devcontainer/start-docker.sh"
}
.devcontainer/
├── devcontainer.json
└── start-docker.sh
@bherbruck I ran the docker-init.sh
script and the output I got was "Setting dockerd Azure DNS", which didn't resolve the docker issue... have you stumbled into something like that?
@darmalovan I get that same output if I run /usr/local/share/docker-init.sh
too many times
docker-in-docker v2.4.0
includes the following changes, see https://github.com/devcontainers/features/pull/669
sed: couldn't flush stdout: Device or resource errors
which fails to start, this adds retries to fix such sed
errorsonCreateCommand
. The stress test is 🟢 Important Note: /usr/local/share/docker-init.sh
which starts/retries dockerd is added to the entrypoint
command. This command runs in the background and is not a blocking script for the container startup. Since it's in the background, onCreateCommand/postCreateCommand/postStartCommand
could all start executing before docker is fully running. If it takes docker too long, that could introduce flakiness in those lifecycle scripts.
Opened https://github.com/devcontainers/spec/issues/299 which requests a new semantics to have "blocking" entrypoints that the CLI waits for. This way we can ensure that docker is already up and running for the mentioned ^ lifecycle scripts and is available in the container.
Closing in the favor of https://github.com/devcontainers/features/issues/671. Feel free to reopen if needed, or comment on https://github.com/devcontainers/features/issues/671 if you still run into docker not running issues. Thank you!
Sometimes docker fails to start within a container with the following error 👇
As manually running
/usr/local/share/docker-init.sh
fixes this issue, add some retry mechanism into the docker-init script .