docker / for-linux

Docker Engine for Linux
https://docs.docker.com/engine/installation/
756 stars 85 forks source link

Ubuntu 18.04.1 - docker stack deploy / rm / network rm fails with 'network xxxxx not found' in 1-node Swarm #531

Open leojonathanoh opened 5 years ago

leojonathanoh commented 5 years ago

Expected behavior

A stack should rm and deploy cleanly

$ docker stack rm my-stack
Removing service my-stack_web
Removing service my-stack_php
Removing service my-stack_db
Removing service my-stack_pma
Removing config nginx_config
Removing config php_config
Removing config db_config
Removing config pma_config
Removing network my-stack_php-network
Removing network my-stack_db-network
Removing network my-stack_db-maintenance-network

$ docker stack services my-stack
Nothing found in stack: my-stack

$ docker stack deploy -c docker-stack.yml my-stack
Creating network my-stack_php-network
Creating network my-stack_db-network
Creating network my-stack_db-maintenance-network
Creating config nginx_config
Creating config php_config
Creating config db_config
Creating config pma_config
Creating service my-stack_web
Creating service my-stack_php
Creating service my-stack_db
Creating service my-stack_pma

$ docker stack services my-stack
ID                  NAME                                                  MODE                REPLICAS            IMAGE                                                                              PORTS
tif6kr90czqo        my-stack_web                   replicated          1/1                 nginx:latest
w4bimkqxze0v        my-stack_php                   replicated          1/1                 php:latest
cs1vkaumhp0e        my-stack_db                   replicated          1/1                 mysql:latest
luez0i5fst54        my-stack_pma                   replicated          1/1                 phpmyadmin/phpmyadmin:latest

Actual behavior

docker-stack rm and a subsequent docker stack deploy of the same stack name fails with an error that network xxxxx not found, resulting in a failed deployment or update of a stack.

Command line:

$ docker stack rm my-stack
Removing service my-stack_web
Removing service my-stack_php
Removing service my-stack_db
Removing service my-stack_pma
Removing config nginx_config
Removing config php_config
Removing config db_config
Removing config pma_config
Removing network my-stack_php-network
Removing network my-stack_db-network
Removing network my-stack_db-maintenance-network
Failed to remove network j1u3lx3xr81hdxbz4twbxggdp: Error response from daemon: network j1u3lx3xr81hdxbz4twbxggdp not foundFailed to remove some resources from stack: my-stack

$ docker stack services my-stack
Nothing found in stack: my-stack

$ docker network ls --filter id=j1u3lx3xr81hdxbz4twbxggdp
NETWORK ID          NAME                                                     DRIVER              SCOPE
j1u3lx3xr81h        my-stack_db-maintenance-network   overlay             swarm

$ docker network rm j1u3lx3xr81hdxbz4twbxggdp
Error: No such network: j1u3lx3xr81h

$ tail -f /var/log/syslog
....
Dec 20 12:10:37 my-docker-server dockerd[1264]: time="2018-12-20T12:10:37.484994813+08:00" level=error msg="Handler for DELETE /v1.39/networks/j1u3lx3xr81h returned error: network j1u3lx3xr81hdxbz4twbxggdp not found"
....

$ docker network inspect j1u3lx3xr81hdxbz4twbxggdp
[
    {
        "Name": "my-stack_db-maintenance-network",
        "Id": "j1u3lx3xr81hdxbz4twbxggdp",
        "Created": "2018-12-20T12:00:12.563895952+08:00",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.0.28.0/24",
                    "Gateway": "10.0.28.1"
                }
            ]
        },
        "Internal": true,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "lb-my-stack_db-maintenance-network": {
                "Name": "my-stack_db-maintenance-network-endpoint",
                "EndpointID": "f5fcb9b4ad9db2e63c045e7902397ab0cb49af817d4b4a5b61a198aa83f3f6c5",
                "MacAddress": "02:42:0a:00:1c:06",
                "IPv4Address": "10.0.28.6/24",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4125"
        },
        "Labels": {
            "com.docker.stack.namespace": "my-stack"
        },
        "Peers": [
            {
                "Name": "4682f8a32651",
                "IP": "10.10.10.1"
            }
        ]
    }
]

$ docker stack deploy -c docker-stack.yml my-stack
Creating network my-stack_php-network
Creating network my-stack_db-network
Creating config nginx_config
Creating config php_config
Creating config db_config
Creating config pma_config
Creating service my-stack_web
Creating service my-stack_php
Creating service my-stack_db
Creating service my-stack_pma
failed to create service my-stack_pma: Error response from daemon: network my-stack_db-maintenance-network not found

$ docker stack services my-stack
ID                  NAME                                                  MODE                REPLICAS            IMAGE                                                                              PORTS
tif6kr90czqo        my-stack_web                   replicated          0/1                 nginx:latest

Explanation

As seen above, remove a stack my-stack, and you get an error Failed to remove network j1u3lx3xr81hdxbz4twbxggdp, with this network being present in docker network ls and called my-stack_db-maintenance-network, but any attempt to remove it with docker network rm j1u3lx3xr81hdxbz4twbxggdp hopelessly shows that the network named my-stack_db-maintenance-network cannot be removed, which is confirmed by the logged error by the Docker daemon in /var/log/syslog. Any subsequent deployment of the stack my-stack fails with the same error, and the stack fails to be created or updated cleanly.

The only workaround is to restart the docker daemon, and the 'missing' network goes away. This however, is not viable for production systems.

Another workaround is to use a different stack name; in this case, use the stack name my-stack-2, and the command docker stack deploy -c docker-stack.yml my-stack-2. This approach is a redeployment of the entire stack under a different stack name, which recreates the networks from scratch namespaced by the stack name. However, such an approach equates to deploying a new stack instead updating an existing stack. It also implies that the deployment process (CD) must be able to detect when the deployment fails, which, should be the job of the orchestration system.

Consequences

Because of this behaviour, a stack cannot be updated completely: sometimes, some of the other services that dont use that 'missing' network are updated, but sometimes none of them are.

The phantom or residue network does not disappear even after many weeks.

Steps to reproduce the behavior

EDIT: One way to reproduce this is to remove an existing stack and redeploying it:

The docker-stack.yml is as such:

version: '3.7'
services:

  web:
    image: nginx:latest
    configs:
      - source: nginx_config
        target: /etc/nginx/nginx.conf
        mode: 0440
    networks:
      - proxy-network
      - php-network
    deploy:
      replicas: 1
      restart_policy:
        condition: any
        delay: 20s

  php:
    image: php:latest
    configs:
      - source: php_config
        target: /usr/local/etc/php/conf.d/custom.php.ini
        mode: 0440
    volumes:
      - type: bind
        source: /data/php/sessions/
        target: /sessions/
    networks:
      - php-network
      - db-network
    deploy:
      replicas: 1
      restart_policy:
        condition: any
        delay: 20s

  db:
    image: mysql:latest
    environment:
      - MYSQL_ROOT_PASSWORD=root
      - MYSQL_DATABASE=db
      - MYSQL_USER=user
      - MYSQL_PASSWORD=password
      - MYSQL_ROOT_HOST=%
    configs:
      - source: mysql_config
        target: /etc/mysql/conf.d/my.cnf
        mode: 0440
    volumes:
      - type: bind
        source: /data/mysql/
        target: /var/lib/mysql/
      - type: bind
        source: /logs/mysql/
        target: /var/log/mysql/
    networks:
      - db-network
      - db-maintenance-network
    deploy:
      replicas: 1
      restart_policy:
        condition: any
        delay: 20s
      placement:
        constraints:
          - node.role == manager
          - node.hostname == ${HOSTNAME:?err}

  pma:
    image: phpmyadmin/phpmyadmin:latest
    environment:
      - PMA_HOST=db
    configs:
      - source: pma_config
        target: /etc/phpmyadmin/config.user.inc.php
        mode: 0440
    volumes:
      - type: bind
        source: /data/pma/sessions/
        target: /sessions/
    networks:
      - proxy-network
      - db-maintenance-network
    deploy:
      replicas: 1
      restart_policy:
        condition: any
        delay: 20s

configs:
  nginx_config:
    name: nginx_config_v1
    file: ./config/nginx/nginx.conf
  php_config:
    name: php_config_v1
    file: ./config/php/conf.d/custom.php.ini
  mysql_config:
    name: mysql_config_v1
    file: ./config/mysql/conf.d/my.cnf
  pma_config:
    name: pma_config_v1
    file: ./config/phpmyadmin/config.user.inc.php

networks:
  proxy-network:
    external: true
  php-network:
    driver: overlay
  db-network:
    internal: true
    driver: overlay
  db-maintenance-network:
    internal: true
    driver: overlay

In this case, the pma service might be removed, but the 'residue' of its my-stack_db-maintenance-network is left.

Output of docker version:

Client:
 Version:           18.09.0
 API version:       1.39
 Go version:        go1.10.4
 Git commit:        4d60db4
 Built:             Wed Nov  7 00:49:01 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.4
  Git commit:       4d60db4
  Built:            Wed Nov  7 00:16:44 2018
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

Containers: 23
 Running: 3
 Paused: 0
 Stopped: 20
Images: 78
Server Version: 18.09.0
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
 NodeID: 8xl7x49wzyqr2zer18cargr90
 Is Manager: true
 ClusterID: degwhh2jleuc2o3s6r90be9jn
 Managers: 1
 Nodes: 1
 Default Address Pool: 10.0.0.0/8
 SubnetSize: 24
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.10.10.1
 Manager Addresses:
  10.10.10.1:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: c4446665cb9c30056f4998ed953e6d4ff22c7c39
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.15.0-39-generic
Operating System: Ubuntu 18.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 2.139GiB
Name: my-docker-server
ID: 2UHY:QOLS:SEUU:5NUI:CE7E:UPKI:TS3P:YZTM:TBVT:XGSD:MQS6:MV53
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.)

Additional Investigation

When does the behavior not occur?

When does the behavior occur?

cjdcordeiro commented 5 years ago

yep I can reproduce this as well

leojonathanoh commented 5 years ago

@cjdcordeiro how did you reproduce this? What is your environment?

leojonathanoh commented 5 years ago

Just an update, since I opened this issue, I am still experiencing the very same problem. It seems to be directly related to using docker stack rm more so than docker stack update. If you only ever use docker stack deploy and never docker stack rm then the problem might never happen.

From my recent experiences on the command line, using a docker stack rm (similar to the reproduce steps above) removes most resources, then when it reaches the point of removing networks it actually removes a few then abruptly hangs for about 10-20 seconds before finally showing that message Failed to remove network: network xxxxx not found. Then, for the next 20-30 seconds, the docker daemon actually hangs: any autocompletion of docker service command lines such as docker service update <TAB> that requires listing docker objects seems to hang for that duration until the docker daemon finally clears up and responds, emitting a bell. However, I did not test whether other docker command lines such as docker ps that are unrelated to Swarm were hung during the duration of that 'hang', because every time I experienced the hang, my command line was hung.

Notably, when using Portainer Web UI to remove a stack, Portainer shows a blank screen with an error red error message on the top right 'Unable to communicate with endpoint'. Portainer then stops working for about the same duration (20-30 seconds), consistent with the duration of the 'hang' on the command line.

cjdcordeiro commented 5 years ago

@leojonathanoh exactly as you've described. Simply doing a docker deploy and then docker rm. The error message will appear and then if I try to re-deploy a stack with the same name, docker will try to pick up that same network, cause it is still listed, but somehow broken.

leojonathanoh commented 5 years ago

@cjdcordeiro Just curious, what was the specs of your stack like? E.g. number of services, networks.

The issue seems to only happen for a stack with at least 2 services, and at least 2 networks.

cjdcordeiro commented 5 years ago

y I had about 6 or 7 services and 3 networks

leojonathanoh commented 5 years ago

@cjdcordeiro thanks for the info

Hopefully others who experience the same issue can share your stack specs, so we can narrow down scope of the issue, and hopefully get a bug fix.

arkodg commented 5 years ago

@leojonathanoh and @cjdcordeiro , do you folks see this issue with the latest Docker CE version 19.03.0-rc2 , there have been some fixes in this area related to stale lb-endpoints (load balancer endpoints) in the last few months

bhupeshkothari commented 5 years ago

@leojonathanoh and @cjdcordeiro , do you folks see this issue with the latest Docker CE version 19.03.0-rc2 , there have been some fixes in this area related to stale lb-endpoints (load balancer endpoints) in the last few months

Yes, this is still occurring on Docker version 19.03.1, build 74b1e89e8a

qbasicer commented 5 years ago

I see this with:

Client:
 Version:           18.09.2
 API version:       1.39
 Go version:        go1.10.4
 Git commit:        6247962
 Built:             Tue Feb 26 23:52:23 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.09.2
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.4
  Git commit:       6247962
  Built:            Wed Feb 13 00:24:14 2019
  OS/Arch:          linux/amd64
  Experimental:     false

On stack rm:

Removing network update_platform
Failed to remove network o3pomemc42ivv4dwg5gzp2zgv: Error response from daemon: network o3pomemc42ivv4dwg5gzp2zgv not foundFailed to remove some resources from stack: update

I have 8 networks defined per docker network ls

If I run the stack rm command a few times with sleeps between them, the state is properly restored (eg: docker stack rm update ; sleep 5 ;docker stack rm update ; sleep 5 ;docker stack rm update ; sleep 5 ; docker stack rm update ; sleep 5 ; docker stack deploy update --compose-file docker-compose.yml)

leojonathanoh commented 5 years ago

its been some time, i'll share a simple workaround ive been using that 100% works rming a stack.

Create a tmp docker-stack.yml

$ cat docker-stack.yml 
version: '3.7'
services:

  tmp:
    image: alpine
    entrypoint: /bin/sh
    command:
      - -c
      - 'sleep 1000000000'

Then deploy over the current stack my-stack:

docker stack deploy -c docker-stack.yml my-stack --prune

That 100% works in removing everything in the original stack my-stack. After which you may safely remove the stack with the tmp service:

$ docker stack rm my-stack

Hope it works for anyone out there.

ghost commented 4 years ago

I have reproduced this behaviour on 2 different swarm clusters. For info see my comment on a similar issue to this that has been open on the moby repo since 2016

kklepper commented 4 years ago

See https://github.com/portainer/portainer/issues/2352#issuecomment-670587634

thosil commented 2 years ago

I got the issue with containers previously bound to a stack, still running, but not listed anymore using docker service ls.

When inspecting the network,they are listed under the "Containers" key.

Killing them was enough to delete the network (docker ps | awk '/<stack name>_/{print $1}' | xargs docker kill)

roskee commented 2 years ago

The same error occurred to me even with one network. it is exactly as @thosil said. after removing the stack, one service didn't shutdown and that is keeping the network from being deleted. It took more that 20 seconds to stop that service manually with docker stop and after that the network automatically got removed.