devplayer0 / docker-net-dhcp

Docker network driver for networking on a host bridge with DHCP-allocated IP addresses
GNU General Public License v3.0
191 stars 49 forks source link

Hang when containers set to start automatically #23

Open rufoa opened 2 years ago

rufoa commented 2 years ago

Thanks for this project - it is great. I have noticed a small problem however:

When a container uses this plugin, and its restart policy makes the container start automatically when Docker starts, this plugin appears to hang. This prevents the container from starting, and seems to block the Docker daemon from responding too. If I kill this plugin's process, Docker seems to recover (but the container obviously doesn't come up properly).

If the container is not set to start automatically, and I instead start it manually, everything works fine.

I have narrowed the problem down to this line - it seems the call to NetworkInspect never returns, even after several hours.

I thought the problem might be a race condition, where the network was not fully up before the plugin tries to inspect it. However, inserting a delay before the call does not appear to help.

The logs do not provide any clues.

Because the Docker daemon stops responding, I am unfortunately not able to get a stack trace from it.

Please could you let me know how I might diagnose the problem further? I'm using up to date versions of Docker, Ubuntu and the kernel. The only complicating factor is that it's on an armv7l SBC :see_no_evil:

Many thanks

crolves commented 2 years ago

I am seeing this exact same behavior on both a Centos 7 install and Ubuntu 21.04 install both on amd64 hardward.

Ermodo commented 2 years ago

I am also facing this issue on Ubuntu 20.04 armv8 and Ubuntu 18.04 armv7.

tekgeek1205 commented 2 years ago

Me also, I'm new to docker and it took me a really long time and 2 reinstalls to realise this was the problem.

Edit: I was able to get containers with the driver to start on boot by setting the restart policy to On Failure and adding

(sleep 20
docker start bragi) &

to my rc.local

flywheelnz commented 2 years ago

I'm also having this problem. From a quick glance, it looks like the plugin tries to access the Docker socket before it's up and running, and there's no timeout on the call and no way that I can see to defer the starting of the containers until after Docker is more started. 🤷

Logs below give a rough indication of the order of things - the container loads and hangs connecting to Docker - is interrupted (By me), and then starts providing the routes it relies on 🤦

DEBU[2021-12-08T14:55:22.411028372+13:00] Assigning addresses for endpoint portainer's interface on network home

^CINFO[2021-12-08T14:57:44.718633565+13:00] Processing signal 'interrupt'
DEBU[2021-12-08T14:57:44.719427052+13:00] Releasing addresses for endpoint portainer's interface on network home
ERRO[2021-12-08T14:57:44.728083546+13:00] failed to start container                     container=ac48752e9a5c199e3f372109b05701261fc659c46a0d5068b64fdaaae9c1f2ed error="failed to create endpoint portainer on network home: NetworkDriver.CreateEndpoint: failed to get network options: failed to get info from Docker: error during connect: Get \"http://%2Frun%2Fdocker.sock/v1.13.1/networks/7b2bec4e103227f40b09c0e8744c514682b6cfa92a751efa79dbee7a7f406546\": read unix @->/var/run/docker.sock: read: connection reset by peer"
INFO[2021-12-08T14:57:44.728203189+13:00] Loading containers: done.
INFO[2021-12-08T14:57:44.749048232+13:00] Docker daemon                                 commit=847da18 graphdriver(s)=overlay2 version=20.10.11
INFO[2021-12-08T14:57:44.750232282+13:00] Daemon has completed initialization
DEBU[2021-12-08T14:57:44.774791290+13:00] Registering routers
ghost commented 2 years ago

Same problem. I worked around it by starting the container independently, from systemd.

Not very stable though.

HackerBaloo commented 2 years ago

I love that this plugin let my containers get an IP over DHCP. But I also experienced this problem. When the docker daemon hangs reboot is also hanging on my Ubuntu 20.04, not fun if your are connected remotely.

Any idea on how to solve it?

sdjnmxd commented 1 year ago

Same problem. I tried to research a very tricky and ad hoc way to temporarily circumvent this problem. First of all, I gave up using docker's restart=always as a boot, because it would cause docker to get stuck. So, I wrote a service file to use systemctl to start the docker container, the file is written like this:

[Unit]
Description=Docker Container
After=network-online.target docker.socket docker.service firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket containerd.service docker.service

[Service]
Type=oneshot
RemainAfterExit=yes
EnvironmentFile=/root/docker-container
ExecStart=docker start ${DOCKER_CONTAINER_LIST}
ExecStop=docker stop ${DOCKER_CONTAINER_LIST}

[Install]
WantedBy=multi-user.target

This file is really tricky. The essence is to execute docker start ${environment variable} to start the container after docker starts normally. Here my environment variable file is placed under /root/docker-container, you can also put it elsewhere. The file for environment variables is written like this:

DOCKER_CONTAINER_LIST=centos nginx phpmyadmin ……

The trouble is, because docker is no longer used to manage boot-started containers, whenever there is a container that needs to be booted up, you need to append the container name or container ID to the environment variable file, so that systemctl can control it to start automatically after booting


My system environment is Rocky Linux release 8.6, and the Docker version is 20.10.17. Different distributions may have some different writing on the service file, such as: "After", "Wants", "Requires", etc. , you may need to make adjustments yourself if necessary

图片

NonaSuomy commented 1 year ago

Same problem. I tried to research a very tricky and ad hoc way to temporarily circumvent this problem. First of all, I gave up using docker's restart=always as a boot, because it would cause docker to get stuck. So, I wrote a service file to use systemctl to start the docker container, the file is written like this:

[Unit]
Description=Docker Container
After=network-online.target docker.socket docker.service firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket containerd.service docker.service

[Service]
Type=oneshot
RemainAfterExit=yes
EnvironmentFile=/root/docker-container
ExecStart=docker start ${DOCKER_CONTAINER_LIST}
ExecStop=docker stop ${DOCKER_CONTAINER_LIST}

[Install]
WantedBy=multi-user.target

This file is really tricky. The essence is to execute docker start ${environment variable} to start the container after docker starts normally. Here my environment variable file is placed under /root/docker-container, you can also put it elsewhere. The file for environment variables is written like this:

DOCKER_CONTAINER_LIST=centos nginx phpmyadmin ……

The trouble is, because docker is no longer used to manage boot-started containers, whenever there is a container that needs to be booted up, you need to append the container name or container ID to the environment variable file, so that systemctl can control it to start automatically after booting

My system environment is Rocky Linux release 8.6, and the Docker version is 20.10.17. Different distributions may have some different writing on the service file, such as: "After", "Wants", "Requires", etc. , you may need to make adjustments yourself if necessary

图片

Nov 08 00:55:14 docker001 docker[9228]: Error response from daemon: No such container: mosquitto portainer esphome frigate rtlamr2mqtt lms watchtower homeassistant Nov 08 00:55:14 docker001 docker[9228]: Error: failed to start containers: mosquitto portainer esphome frigate rtlamr2mqtt lms watchtower homeassistant Nov 08 00:55:14 docker001 systemd[1]: dockernetdhcp.service: Main process exited, code=exited, status=1/FAILURE Nov 08 00:55:14 docker001 systemd[1]: dockernetdhcp.service: Failed with result 'exit-code'. Nov 08 00:55:14 docker001 systemd[1]: Failed to start Docker Net DHCP Container Hang Fix.

Manual test: Seems to work fine. [root@docker001 system]# docker start mosquitto portainer esphome frigate rtlamr2mqtt lms watchtower homeassistant mosquitto portainer esphome frigate rtlamr2mqtt lms watchtower homeassistant

Edit:

Got this working like this: /etc/systemd/system/dockernetdhcp.service

[Unit]
Description=Docker Net DHCP Container Hang Fix
After=network-online.target docker.socket docker.service containerd.service
Wants=network-online.target
Requires=docker.socket containerd.service docker.service

[Service]
Type=oneshot
RemainAfterExit=yes
EnvironmentFile=/root/docker-container
ExecStart=/usr/bin/dockernetdhcpstart.sh
#ExecStop=/usr/bin/docker stop ${DOCKER_CONTAINER_LIST}
ExecStop=/usr/bin/dockernetdhcpstop.sh

[Install]
WantedBy=multi-user.target

/usr/bin/dockernetdhcpstart.sh

#!/bin/bash
# Start docker containers because of a glitch with netdhcp docker plugin.
docker start mosquitto portainer cloudflared esphome frigate rtlamr2mqtt lms watchtower homeassistant piper whisper openwakeword spoolman

/usr/bin/dockernetdhcpstop.sh

#!/bin/bash
# Stop docker containers because of a glitch with netdhcp docker plugin.
docker stop mosquitto portainer cloudflared esphome frigate rtlamr2mqtt lms watchtower homeassistant piper whisper openwakeword spoolman
NonaSuomy commented 1 year ago

@devplayer0 are you able to comment on the overall OP issue above?

Thank you.

crkinard commented 1 year ago

Getting this issue as well. Time to try some of these solutions.

EDIT: Well I got nothing. Looks like I'm going back to macvlan. Here is to hoping this gets fixed.

crkinard commented 1 year ago

Same problem. I tried to research a very tricky and ad hoc way to temporarily circumvent this problem. First of all, I gave up using docker's restart=always as a boot, because it would cause docker to get stuck. So, I wrote a service file to use systemctl to start the docker container, the file is written like this:

[Unit]
Description=Docker Container
After=network-online.target docker.socket docker.service firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket containerd.service docker.service

[Service]
Type=oneshot
RemainAfterExit=yes
EnvironmentFile=/root/docker-container
ExecStart=docker start ${DOCKER_CONTAINER_LIST}
ExecStop=docker stop ${DOCKER_CONTAINER_LIST}

[Install]
WantedBy=multi-user.target

This file is really tricky. The essence is to execute docker start ${environment variable} to start the container after docker starts normally. Here my environment variable file is placed under /root/docker-container, you can also put it elsewhere. The file for environment variables is written like this:

DOCKER_CONTAINER_LIST=centos nginx phpmyadmin ……

The trouble is, because docker is no longer used to manage boot-started containers, whenever there is a container that needs to be booted up, you need to append the container name or container ID to the environment variable file, so that systemctl can control it to start automatically after booting

My system environment is Rocky Linux release 8.6, and the Docker version is 20.10.17. Different distributions may have some different writing on the service file, such as: "After", "Wants", "Requires", etc. , you may need to make adjustments yourself if necessary

图片

Ended up just tweaking your script a tad for docker-compose.

[Unit]
Description=Docker-Compose Up
After=network-online.target docker.socket docker.service firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket containerd.service docker.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=docker-compose -f /usr/sbin/docker-compose.yml up -d
ExecStop=docker-compose -f /usr/sbin/docker-compose.yml down

[Install]
WantedBy=multi-user.target
NonaSuomy commented 8 months ago

@devplayer0 any chance you can compile the above timeout fix for one last build?

I've been using this plugin for years and this is the only issue I have had. Would be nice to not have to work around it with startup scripts.

NonaSuomy commented 4 months ago

Same issues https://github.com/devplayer0/docker-net-dhcp/issues/2 https://github.com/devplayer0/docker-net-dhcp/issues/44

https://github.com/devplayer0/docker-net-dhcp/blob/03694af592d378ac7062d464dd2050e1e892d65a/pkg/plugin/plugin.go#L78

https://github.com/devplayer0/docker-net-dhcp/blob/1bb0ffe9f27531b8ec94a39788b1e7ccf6588ab0/pkg/plugin/plugin.go#L82

NonaSuomy commented 4 months ago

To manually apply PR https://github.com/devplayer0/docker-net-dhcp/pull/43 and compile/install the plugin locally (fixes this and issue https://github.com/devplayer0/docker-net-dhcp/issues/42 1.13.1 error with newer versions of docker) hope this helps people like me who don't know what they are doing!

git clone https://github.com/devplayer0/docker-net-dhcp.git
cd docker-net-dhcp
git fetch origin pull/43/head:celerway
git checkout celerway
Switched to branch 'celerway'

git branch -a
* celerway
  master
  remotes/origin/HEAD -> origin/master
  remotes/origin/dependabot/go_modules/github.com/containerd/containerd-1.5.18
  remotes/origin/dependabot/go_modules/github.com/docker/docker-20.10.24incompatible
  remotes/origin/dependabot/go_modules/golang.org/x/net-0.7.0
  remotes/origin/dependabot/go_modules/golang.org/x/sys-0.1.0
  remotes/origin/master

make create

docker plugin ls
ID             NAME                                        DESCRIPTION                          ENABLED
############   ghcr.io/devplayer0/docker-net-dhcp:golang   Docker host bridge DHCP networking   false

sudo docker plugin enable ghcr.io/devplayer0/docker-net-dhcp:golang
docker plugin ls
ID             NAME                                        DESCRIPTION                          ENABLED
############   ghcr.io/devplayer0/docker-net-dhcp:golang   Docker host bridge DHCP networking   true

sudo docker network ls
NETWORK ID     NAME               DRIVER                                                   SCOPE
############   bridge             bridge                                                   local
############   config_default     bridge                                                   local
############   dbrv100            ghcr.io/devplayer0/docker-net-dhcp:release-linux-amd64   local
############   dbrv200            ghcr.io/devplayer0/docker-net-dhcp:release-linux-amd64   local
############   dbrv300            ghcr.io/devplayer0/docker-net-dhcp:release-linux-amd64   local
############   dbrv350            ghcr.io/devplayer0/docker-net-dhcp:release-linux-amd64   local
############   dbrv400            ghcr.io/devplayer0/docker-net-dhcp:release-linux-amd64   local

Shows all your old docker-net-dhcp additions, we need to remove them :(

I first tried:
sudo docker network rm dbrv100
Error response from daemon: error while removing network: failed deleting Network: plugin "ghcr.io/devplayer0/docker-net-dhcp:release-linux-amd64" not found

So then I nuked them with
sudo docker network prune
WARNING! This will remove all custom networks not used by at least one container.
Are you sure you want to continue? [y/N] y
sudo docker network ls
NETWORK ID     NAME      DRIVER    SCOPE
############   bridge    bridge    local
############   host      host      local
############   none      null      local

Then added them all back with the new compiled driver

sudo docker network create -d ghcr.io/devplayer0/docker-net-dhcp:golang --ipam-driver null -o bridge=brv100 dbrv100
sudo docker network create -d ghcr.io/devplayer0/docker-net-dhcp:golang --ipam-driver null -o bridge=brv200 dbrv200
sudo docker network create -d ghcr.io/devplayer0/docker-net-dhcp:golang --ipam-driver null -o bridge=brv300 dbrv300
sudo docker network create -d ghcr.io/devplayer0/docker-net-dhcp:golang --ipam-driver null -o bridge=brv350 dbrv350
sudo docker network create -d ghcr.io/devplayer0/docker-net-dhcp:golang --ipam-driver null -o bridge=brv400 dbrv400

sudo docker network ls
NETWORK ID     NAME      DRIVER                                      SCOPE
############   bridge    bridge                                      local
############   dbrv100   ghcr.io/devplayer0/docker-net-dhcp:golang   local
############   dbrv200   ghcr.io/devplayer0/docker-net-dhcp:golang   local
############   dbrv300   ghcr.io/devplayer0/docker-net-dhcp:golang   local
############   dbrv350   ghcr.io/devplayer0/docker-net-dhcp:golang   local
############   dbrv400   ghcr.io/devplayer0/docker-net-dhcp:golang   local
############   host      host                                        local
############   none      null                                        local

Now they are all back in action but all the containers still point to the old bridges 😭 

docker container inspect genericcontainer001

you will see all the old ID's for the old networks...
I tried doing 
sudo docker network disconnect dbrv100 genericcontainer001
sudo docker network connect dbrv100 genericcontainer001

This changed the network so I thought win!

Yet NetworkMode was still stuck with the old network and would fail on startup.
sudo docker container start genericcontainer001
Error response from daemon: could not find a network matching network mode ####...: network ####... not found
Error: failed to start containers: genericcontainer001

So then I ran compose

docker-compose.yml
version: '3.9'

services:
  genericcontainer001:
    container_name: genericcontainer001
    hostname: genericcontainer001
    mac_address: de:ad:be:ef:00:01
    networks:
      - dhcp
networks:
  dhcp:
    #mac_address: de:ad:be:ef:00:01 #(for docker engine verison 25)
    name: dbrv100
    external: true

docker compose up -d

It came back up with the right network without an issue yay!

Thanks to encbladexp in docker discord for helping me figure out the compile technique! Thanks to manjax in docker discord for telling me to use compose to bring the system up to fix the straggling networkmode issue.

Now I just need to overcome the docker compose / dockerd ignoring the set mac_address then I will be up and running again. I had it working prior by using moby build of dockerd but something else must have broke in it as well as it just generates random mac addresses which breaks my dnsmasq static IP's set via dhcp to mac addresses :(

Portainer is working on a fix in 2.20 the rest of docker seems to have no fix for it again.

totobo commented 1 month ago

I made a container image that includes the fix for PR https://github.com/devplayer0/docker-net-dhcp/pull/43 . Everyone can use this image to install the plugin directly. docker plugin install totobo/docker-net-dhcp:v0.1.4-pull43 However, in my test, PR 43 only fixed the problem of docker daemon stuck due to timeout when the container started. In fact, the container that needs to be started at boot still fails to start. Based on this, I think this plugin is defective when the docker service is initialized after the server restarts. The specific sequence is as follows:

  1. The docker service starts and begins to initialize.
  2. During initialization, docker will start all containers that need to be started automatically.
  3. When the container using this plugin starts, it will use the docker API to query network information, but because the docker service initialization has not been completed, the docker API is still inaccessible. Therefore, the plugin will remain in a suspended state, waiting for the docker API to be accessible and return the required information (PR 43 adds a timeout here to avoid waiting indefinitely)

Based on the above process, a similar interlocking scenario is generated, which causes the container service installed with this plugin to fail to start normally. I have a possible solution, but I don't have the go language development ability to implement it. I hope the author or other PRs can implement similar functions: when the container is started, the plug-in should not call the docker API to obtain the container's network information, but instead call the docker container runtime (containerd) client library to query the container's network information.

totobo commented 1 month ago

I wrote a bash script and a custom systemctl service that can start the container service with the DHCP plug-in enabled at boot time . The container service list is obtained dynamically through the configuration file, and there is no need to manually configure the container list. Before using this script, you need to ensure that the jq command is installed on the server.

It is best to use it with the plugin totobo/docker-net-dhcp:v0.1.4-pull43. This plugin integrates the content modified in PR 43 and can solve the problem of docker getting stuck when the server is started.

cat /lib/systemd/system/docker-dhcp-container.service

[Unit]
Description=Docker DHCP Container
After=network-online.target docker.socket docker.service firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket containerd.service docker.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=bash -c 'docker start $(bash /data/scripts/dhcpcontainer.sh my-dhcp-net)'
ExecStop=bash -c 'docker stop $(bash /data/scripts/dhcpcontainer.sh my-dhcp-net)'

[Install]
WantedBy=multi-user.target

cat /data/scripts/dhcpcontainer.sh

#!/bin/bash
filter_word=$1
container_path="/var/lib/docker/containers"
filterd_container_id=$(jq -r --arg network $filter_word '.|select(.NetworkSettings.Networks[$network]) | .ID' $container_path/*/config.v2.json)
for i in $filterd_container_id
do
    restart_policy=$(jq -r '.|.RestartPolicy.Name' $container_path/$i/hostconfig.json)
    HasBeenManuallyStopped=$(jq -r '.|.HasBeenManuallyStopped' $container_path/$i/config.v2.json)
    if [[ "$restart_policy" == "unless-stopped" ]] && [[ "$HasBeenManuallyStopped" == "true" ]];
    then
        continue
    fi
    if [[ "$restart_policy" != "no" ]];
    then
        final_container_id="$i $final_container_id"
    fi

done

for j in $final_container_id
do
    container_name=$(jq -r '.|.Name' $container_path/$j/config.v2.json|sed 's#/##g')
    final_container_name="$container_name $final_container_name"
done

echo "$final_container_name"
NonaSuomy commented 3 weeks ago

Work with the current free claude.ai 3.5 sonnet to code what you need. Its humaneval is currently at 92.7% it may bring you to figure out how it needs to go together where you said you were lacking the skill.

I don't like using bash scripting to solve where it has an issue.