Open rufoa opened 3 years ago
I am seeing this exact same behavior on both a Centos 7 install and Ubuntu 21.04 install both on amd64 hardward.
I am also facing this issue on Ubuntu 20.04 armv8 and Ubuntu 18.04 armv7.
Me also, I'm new to docker and it took me a really long time and 2 reinstalls to realise this was the problem.
Edit: I was able to get containers with the driver to start on boot by setting the restart policy to On Failure and adding
(sleep 20
docker start bragi) &
to my rc.local
I'm also having this problem. From a quick glance, it looks like the plugin tries to access the Docker socket before it's up and running, and there's no timeout on the call and no way that I can see to defer the starting of the containers until after Docker is more started. 🤷
Logs below give a rough indication of the order of things - the container loads and hangs connecting to Docker - is interrupted (By me), and then starts providing the routes it relies on 🤦
DEBU[2021-12-08T14:55:22.411028372+13:00] Assigning addresses for endpoint portainer's interface on network home
^CINFO[2021-12-08T14:57:44.718633565+13:00] Processing signal 'interrupt'
DEBU[2021-12-08T14:57:44.719427052+13:00] Releasing addresses for endpoint portainer's interface on network home
ERRO[2021-12-08T14:57:44.728083546+13:00] failed to start container container=ac48752e9a5c199e3f372109b05701261fc659c46a0d5068b64fdaaae9c1f2ed error="failed to create endpoint portainer on network home: NetworkDriver.CreateEndpoint: failed to get network options: failed to get info from Docker: error during connect: Get \"http://%2Frun%2Fdocker.sock/v1.13.1/networks/7b2bec4e103227f40b09c0e8744c514682b6cfa92a751efa79dbee7a7f406546\": read unix @->/var/run/docker.sock: read: connection reset by peer"
INFO[2021-12-08T14:57:44.728203189+13:00] Loading containers: done.
INFO[2021-12-08T14:57:44.749048232+13:00] Docker daemon commit=847da18 graphdriver(s)=overlay2 version=20.10.11
INFO[2021-12-08T14:57:44.750232282+13:00] Daemon has completed initialization
DEBU[2021-12-08T14:57:44.774791290+13:00] Registering routers
Same problem. I worked around it by starting the container independently, from systemd.
Not very stable though.
I love that this plugin let my containers get an IP over DHCP. But I also experienced this problem. When the docker daemon hangs reboot is also hanging on my Ubuntu 20.04, not fun if your are connected remotely.
Any idea on how to solve it?
Same problem. I tried to research a very tricky and ad hoc way to temporarily circumvent this problem. First of all, I gave up using docker's restart=always as a boot, because it would cause docker to get stuck. So, I wrote a service file to use systemctl to start the docker container, the file is written like this:
[Unit]
Description=Docker Container
After=network-online.target docker.socket docker.service firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket containerd.service docker.service
[Service]
Type=oneshot
RemainAfterExit=yes
EnvironmentFile=/root/docker-container
ExecStart=docker start ${DOCKER_CONTAINER_LIST}
ExecStop=docker stop ${DOCKER_CONTAINER_LIST}
[Install]
WantedBy=multi-user.target
This file is really tricky. The essence is to execute docker start ${environment variable} to start the container after docker starts normally. Here my environment variable file is placed under /root/docker-container
, you can also put it elsewhere. The file for environment variables is written like this:
DOCKER_CONTAINER_LIST=centos nginx phpmyadmin ……
The trouble is, because docker is no longer used to manage boot-started containers, whenever there is a container that needs to be booted up, you need to append the container name or container ID to the environment variable file, so that systemctl can control it to start automatically after booting
My system environment is Rocky Linux release 8.6
, and the Docker version is 20.10.17
. Different distributions may have some different writing on the service file, such as: "After", "Wants", "Requires", etc. , you may need to make adjustments yourself if necessary
Same problem. I tried to research a very tricky and ad hoc way to temporarily circumvent this problem. First of all, I gave up using docker's restart=always as a boot, because it would cause docker to get stuck. So, I wrote a service file to use systemctl to start the docker container, the file is written like this:
[Unit] Description=Docker Container After=network-online.target docker.socket docker.service firewalld.service containerd.service Wants=network-online.target Requires=docker.socket containerd.service docker.service [Service] Type=oneshot RemainAfterExit=yes EnvironmentFile=/root/docker-container ExecStart=docker start ${DOCKER_CONTAINER_LIST} ExecStop=docker stop ${DOCKER_CONTAINER_LIST} [Install] WantedBy=multi-user.target
This file is really tricky. The essence is to execute docker start ${environment variable} to start the container after docker starts normally. Here my environment variable file is placed under
/root/docker-container
, you can also put it elsewhere. The file for environment variables is written like this:DOCKER_CONTAINER_LIST=centos nginx phpmyadmin ……
The trouble is, because docker is no longer used to manage boot-started containers, whenever there is a container that needs to be booted up, you need to append the container name or container ID to the environment variable file, so that systemctl can control it to start automatically after booting
My system environment is
Rocky Linux release 8.6
, and the Docker version is20.10.17
. Different distributions may have some different writing on the service file, such as: "After", "Wants", "Requires", etc. , you may need to make adjustments yourself if necessary
Nov 08 00:55:14 docker001 docker[9228]: Error response from daemon: No such container: mosquitto portainer esphome frigate rtlamr2mqtt lms watchtower homeassistant Nov 08 00:55:14 docker001 docker[9228]: Error: failed to start containers: mosquitto portainer esphome frigate rtlamr2mqtt lms watchtower homeassistant Nov 08 00:55:14 docker001 systemd[1]: dockernetdhcp.service: Main process exited, code=exited, status=1/FAILURE Nov 08 00:55:14 docker001 systemd[1]: dockernetdhcp.service: Failed with result 'exit-code'. Nov 08 00:55:14 docker001 systemd[1]: Failed to start Docker Net DHCP Container Hang Fix.
Manual test: Seems to work fine. [root@docker001 system]# docker start mosquitto portainer esphome frigate rtlamr2mqtt lms watchtower homeassistant mosquitto portainer esphome frigate rtlamr2mqtt lms watchtower homeassistant
Edit:
Got this working like this: /etc/systemd/system/dockernetdhcp.service
[Unit]
Description=Docker Net DHCP Container Hang Fix
After=network-online.target docker.socket docker.service containerd.service
Wants=network-online.target
Requires=docker.socket containerd.service docker.service
[Service]
Type=oneshot
RemainAfterExit=yes
EnvironmentFile=/root/docker-container
ExecStart=/usr/bin/dockernetdhcpstart.sh
#ExecStop=/usr/bin/docker stop ${DOCKER_CONTAINER_LIST}
ExecStop=/usr/bin/dockernetdhcpstop.sh
[Install]
WantedBy=multi-user.target
/usr/bin/dockernetdhcpstart.sh
#!/bin/bash
# Start docker containers because of a glitch with netdhcp docker plugin.
docker start mosquitto portainer cloudflared esphome frigate rtlamr2mqtt lms watchtower homeassistant piper whisper openwakeword spoolman
/usr/bin/dockernetdhcpstop.sh
#!/bin/bash
# Stop docker containers because of a glitch with netdhcp docker plugin.
docker stop mosquitto portainer cloudflared esphome frigate rtlamr2mqtt lms watchtower homeassistant piper whisper openwakeword spoolman
@devplayer0 are you able to comment on the overall OP issue above?
Thank you.
Getting this issue as well. Time to try some of these solutions.
EDIT: Well I got nothing. Looks like I'm going back to macvlan. Here is to hoping this gets fixed.
Same problem. I tried to research a very tricky and ad hoc way to temporarily circumvent this problem. First of all, I gave up using docker's restart=always as a boot, because it would cause docker to get stuck. So, I wrote a service file to use systemctl to start the docker container, the file is written like this:
[Unit] Description=Docker Container After=network-online.target docker.socket docker.service firewalld.service containerd.service Wants=network-online.target Requires=docker.socket containerd.service docker.service [Service] Type=oneshot RemainAfterExit=yes EnvironmentFile=/root/docker-container ExecStart=docker start ${DOCKER_CONTAINER_LIST} ExecStop=docker stop ${DOCKER_CONTAINER_LIST} [Install] WantedBy=multi-user.target
This file is really tricky. The essence is to execute docker start ${environment variable} to start the container after docker starts normally. Here my environment variable file is placed under
/root/docker-container
, you can also put it elsewhere. The file for environment variables is written like this:DOCKER_CONTAINER_LIST=centos nginx phpmyadmin ……
The trouble is, because docker is no longer used to manage boot-started containers, whenever there is a container that needs to be booted up, you need to append the container name or container ID to the environment variable file, so that systemctl can control it to start automatically after booting
My system environment is
Rocky Linux release 8.6
, and the Docker version is20.10.17
. Different distributions may have some different writing on the service file, such as: "After", "Wants", "Requires", etc. , you may need to make adjustments yourself if necessary
Ended up just tweaking your script a tad for docker-compose.
[Unit]
Description=Docker-Compose Up
After=network-online.target docker.socket docker.service firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket containerd.service docker.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=docker-compose -f /usr/sbin/docker-compose.yml up -d
ExecStop=docker-compose -f /usr/sbin/docker-compose.yml down
[Install]
WantedBy=multi-user.target
@devplayer0 any chance you can compile the above timeout fix for one last build?
I've been using this plugin for years and this is the only issue I have had. Would be nice to not have to work around it with startup scripts.
To manually apply PR https://github.com/devplayer0/docker-net-dhcp/pull/43 and compile/install the plugin locally (fixes this and issue https://github.com/devplayer0/docker-net-dhcp/issues/42 1.13.1 error with newer versions of docker) hope this helps people like me who don't know what they are doing!
git clone https://github.com/devplayer0/docker-net-dhcp.git
cd docker-net-dhcp
git fetch origin pull/43/head:celerway
git checkout celerway
Switched to branch 'celerway'
git branch -a
* celerway
master
remotes/origin/HEAD -> origin/master
remotes/origin/dependabot/go_modules/github.com/containerd/containerd-1.5.18
remotes/origin/dependabot/go_modules/github.com/docker/docker-20.10.24incompatible
remotes/origin/dependabot/go_modules/golang.org/x/net-0.7.0
remotes/origin/dependabot/go_modules/golang.org/x/sys-0.1.0
remotes/origin/master
make create
docker plugin ls
ID NAME DESCRIPTION ENABLED
############ ghcr.io/devplayer0/docker-net-dhcp:golang Docker host bridge DHCP networking false
sudo docker plugin enable ghcr.io/devplayer0/docker-net-dhcp:golang
docker plugin ls
ID NAME DESCRIPTION ENABLED
############ ghcr.io/devplayer0/docker-net-dhcp:golang Docker host bridge DHCP networking true
sudo docker network ls
NETWORK ID NAME DRIVER SCOPE
############ bridge bridge local
############ config_default bridge local
############ dbrv100 ghcr.io/devplayer0/docker-net-dhcp:release-linux-amd64 local
############ dbrv200 ghcr.io/devplayer0/docker-net-dhcp:release-linux-amd64 local
############ dbrv300 ghcr.io/devplayer0/docker-net-dhcp:release-linux-amd64 local
############ dbrv350 ghcr.io/devplayer0/docker-net-dhcp:release-linux-amd64 local
############ dbrv400 ghcr.io/devplayer0/docker-net-dhcp:release-linux-amd64 local
Shows all your old docker-net-dhcp additions, we need to remove them :(
I first tried:
sudo docker network rm dbrv100
Error response from daemon: error while removing network: failed deleting Network: plugin "ghcr.io/devplayer0/docker-net-dhcp:release-linux-amd64" not found
So then I nuked them with
sudo docker network prune
WARNING! This will remove all custom networks not used by at least one container.
Are you sure you want to continue? [y/N] y
sudo docker network ls
NETWORK ID NAME DRIVER SCOPE
############ bridge bridge local
############ host host local
############ none null local
Then added them all back with the new compiled driver
sudo docker network create -d ghcr.io/devplayer0/docker-net-dhcp:golang --ipam-driver null -o bridge=brv100 dbrv100
sudo docker network create -d ghcr.io/devplayer0/docker-net-dhcp:golang --ipam-driver null -o bridge=brv200 dbrv200
sudo docker network create -d ghcr.io/devplayer0/docker-net-dhcp:golang --ipam-driver null -o bridge=brv300 dbrv300
sudo docker network create -d ghcr.io/devplayer0/docker-net-dhcp:golang --ipam-driver null -o bridge=brv350 dbrv350
sudo docker network create -d ghcr.io/devplayer0/docker-net-dhcp:golang --ipam-driver null -o bridge=brv400 dbrv400
sudo docker network ls
NETWORK ID NAME DRIVER SCOPE
############ bridge bridge local
############ dbrv100 ghcr.io/devplayer0/docker-net-dhcp:golang local
############ dbrv200 ghcr.io/devplayer0/docker-net-dhcp:golang local
############ dbrv300 ghcr.io/devplayer0/docker-net-dhcp:golang local
############ dbrv350 ghcr.io/devplayer0/docker-net-dhcp:golang local
############ dbrv400 ghcr.io/devplayer0/docker-net-dhcp:golang local
############ host host local
############ none null local
Now they are all back in action but all the containers still point to the old bridges 😭
docker container inspect genericcontainer001
you will see all the old ID's for the old networks...
I tried doing
sudo docker network disconnect dbrv100 genericcontainer001
sudo docker network connect dbrv100 genericcontainer001
This changed the network so I thought win!
Yet NetworkMode was still stuck with the old network and would fail on startup.
sudo docker container start genericcontainer001
Error response from daemon: could not find a network matching network mode ####...: network ####... not found
Error: failed to start containers: genericcontainer001
So then I ran compose
docker-compose.yml
version: '3.9'
services:
genericcontainer001:
container_name: genericcontainer001
hostname: genericcontainer001
mac_address: de:ad:be:ef:00:01
networks:
- dhcp
networks:
dhcp:
#mac_address: de:ad:be:ef:00:01 #(for docker engine verison 25)
name: dbrv100
external: true
docker compose up -d
It came back up with the right network without an issue yay!
Thanks to encbladexp in docker discord for helping me figure out the compile technique! Thanks to manjax in docker discord for telling me to use compose to bring the system up to fix the straggling networkmode issue.
Now I just need to overcome the docker compose / dockerd ignoring the set mac_address then I will be up and running again. I had it working prior by using moby build of dockerd but something else must have broke in it as well as it just generates random mac addresses which breaks my dnsmasq static IP's set via dhcp to mac addresses :(
Portainer is working on a fix in 2.20 the rest of docker seems to have no fix for it again.
I made a container image that includes the fix for PR https://github.com/devplayer0/docker-net-dhcp/pull/43 . Everyone can use this image to install the plugin directly.
docker plugin install totobo/docker-net-dhcp:v0.1.4-pull43
However, in my test, PR 43 only fixed the problem of docker daemon stuck due to timeout when the container started. In fact, the container that needs to be started at boot still fails to start. Based on this, I think this plugin is defective when the docker service is initialized after the server restarts. The specific sequence is as follows:
Based on the above process, a similar interlocking scenario is generated, which causes the container service installed with this plugin to fail to start normally. I have a possible solution, but I don't have the go language development ability to implement it. I hope the author or other PRs can implement similar functions: when the container is started, the plug-in should not call the docker API to obtain the container's network information, but instead call the docker container runtime (containerd) client library to query the container's network information.
I wrote a bash script and a custom systemctl service that can start the container service with the DHCP plug-in enabled at boot time . The container service list is obtained dynamically through the configuration file, and there is no need to manually configure the container list. Before using this script, you need to ensure that the jq command is installed on the server.
It is best to use it with the plugin totobo/docker-net-dhcp:v0.1.4-pull43
. This plugin integrates the content modified in PR 43 and can solve the problem of docker getting stuck when the server is started.
cat /lib/systemd/system/docker-dhcp-container.service
[Unit]
Description=Docker DHCP Container
After=network-online.target docker.socket docker.service firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket containerd.service docker.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=bash -c 'docker start $(bash /data/scripts/dhcpcontainer.sh my-dhcp-net)'
ExecStop=bash -c 'docker stop $(bash /data/scripts/dhcpcontainer.sh my-dhcp-net)'
[Install]
WantedBy=multi-user.target
cat /data/scripts/dhcpcontainer.sh
#!/bin/bash
filter_word=$1
container_path="/var/lib/docker/containers"
filterd_container_id=$(jq -r --arg network $filter_word '.|select(.NetworkSettings.Networks[$network]) | .ID' $container_path/*/config.v2.json)
for i in $filterd_container_id
do
restart_policy=$(jq -r '.|.RestartPolicy.Name' $container_path/$i/hostconfig.json)
HasBeenManuallyStopped=$(jq -r '.|.HasBeenManuallyStopped' $container_path/$i/config.v2.json)
if [[ "$restart_policy" == "unless-stopped" ]] && [[ "$HasBeenManuallyStopped" == "true" ]];
then
continue
fi
if [[ "$restart_policy" != "no" ]];
then
final_container_id="$i $final_container_id"
fi
done
for j in $final_container_id
do
container_name=$(jq -r '.|.Name' $container_path/$j/config.v2.json|sed 's#/##g')
final_container_name="$container_name $final_container_name"
done
echo "$final_container_name"
Work with the current free claude.ai 3.5 sonnet to code what you need. Its humaneval is currently at 92.7% it may bring you to figure out how it needs to go together where you said you were lacking the skill.
I don't like using bash scripting to solve where it has an issue.
Thanks for this project - it is great. I have noticed a small problem however:
When a container uses this plugin, and its restart policy makes the container start automatically when Docker starts, this plugin appears to hang. This prevents the container from starting, and seems to block the Docker daemon from responding too. If I
kill
this plugin's process, Docker seems to recover (but the container obviously doesn't come up properly).If the container is not set to start automatically, and I instead start it manually, everything works fine.
I have narrowed the problem down to this line - it seems the call to
NetworkInspect
never returns, even after several hours.I thought the problem might be a race condition, where the network was not fully up before the plugin tries to inspect it. However, inserting a delay before the call does not appear to help.
The logs do not provide any clues.
Because the Docker daemon stops responding, I am unfortunately not able to get a stack trace from it.
Please could you let me know how I might diagnose the problem further? I'm using up to date versions of Docker, Ubuntu and the kernel. The only complicating factor is that it's on an armv7l SBC :see_no_evil:
Many thanks