docker / cli

The Docker CLI
Apache License 2.0
4.85k stars 1.91k forks source link

Docker stack deploy creates random tasks, and doesn't clean up, in drain mode #877

Open thaJeztah opened 6 years ago

thaJeztah commented 6 years ago

Description

If a node is in "drain" mode, tasks can be orphaned, and are not cleaned up. I noticed this when using docker stack deploy, but possibly there's other ways to arrive in this situation

Steps to reproduce the issue:

On a single-node swarm, put the node in availability drain:

docker node update --availability=drain $(docker node ls --quiet)

Deploy a stack:

docker stack deploy -c- mystack <<EOF
version: "3.5"
services:
  web:
    image: nginx:alpine
    ports:
      - "80:80"
EOF

Watch the output of docker stack ps mystack:

docker stack ps mystack

ID                  NAME                          IMAGE               NODE                DESIRED STATE       CURRENT STATE                    ERROR                              PORTS
l642i9p5gs4c        mystack_web.1                 nginx:alpine                            Running             Pending less than a second ago   "no suitable node (1 node not …"   
0vqk2i4xdgeg        pnsu2xb5xavnw5q7vmabygz8y.1   nginx:alpine                            Remove              Pending 5 minutes ago            "no suitable node (1 node not …"   

Remove the stack:

docker stack rm mystack

And re-deploy the stack

docker stack deploy -c- mystack <<EOF
version: "3.5"
services:
  web:
    image: nginx:alpine
    ports:
      - "80:80"
EOF

Notice that one more randomly named task is added:

docker stack ps mystack

ID                  NAME                          IMAGE               NODE                DESIRED STATE       CURRENT STATE                    ERROR                              PORTS
v99e6svzbw1l        mystack_web.1                 nginx:alpine                            Running             Pending less than a second ago   "no suitable node (1 node not …"   
l642i9p5gs4c        ma35f4bhbygnw7exbrkqw3oct.1   nginx:alpine                            Remove              Pending about a minute ago       "no suitable node (1 node not …"   
0vqk2i4xdgeg        pnsu2xb5xavnw5q7vmabygz8y.1   nginx:alpine                            Remove              Pending 7 minutes ago            "no suitable node (1 node not …"   

Describe the results you received:

Each re-deploy adds a new task with desired state Remove. These tasks are never removed as long as the node is in "drain" mode, even if the service is removed.

$ docker service ps mystack_web

ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE           ERROR                              PORTS
hhzfa05q7kqf        mystack_web.1       nginx:alpine                            Running             Pending 6 minutes ago   "no suitable node (1 node not …"   

Each task also belongs to a different service;

$ docker inspect --format '{{.ServiceID}}' $(docker stack ps -q mystack)
kgb2xubypxq0zqht51bvy9tlj
koxfo2b8ers4jpy3vebvcygxj
ma35f4bhbygnw7exbrkqw3oct
pnsu2xb5xavnw5q7vmabygz8y

Describe the results you expected:

Tasks to be removed

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

Client:
 Version:   18.02.0-ce
 API version:   1.36
 Go version:    go1.9.3
 Git commit:    fc4de44
 Built: Wed Feb  7 21:13:05 2018
 OS/Arch:   darwin/amd64
 Experimental:  true
 Orchestrator:  swarm

Server:
 Engine:
  Version:  18.02.0-ce
  API version:  1.36 (minimum version 1.12)
  Go version:   go1.9.3
  Git commit:   fc4de44
  Built:    Wed Feb  7 21:20:15 2018
  OS/Arch:  linux/amd64
  Experimental: true

Output of docker info:

Containers: 1
 Running: 0
 Paused: 0
 Stopped: 1
Images: 361
Server Version: 18.02.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: oifk2p0hd4tvlb62uf76womx0
 Is Manager: true
 ClusterID: emg7r0j0ou50nqgt8egutjdhl
 Managers: 1
 Nodes: 1
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 192.168.65.3
 Manager Addresses:
  192.168.65.3:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9b55aab90508bd389d7654c4baf173a981477d55
runc version: 9f9c96235cc97674e935002fc3d78361b696a69e
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.75-linuxkit-aufs
Operating System: Docker for Mac
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 1.952GiB
Name: linuxkit-025000000001
ID: NAGV:GKNJ:7XC7:YWGV:4JLV:3RWY:TJEQ:BSHI:CYHK:XHOH:E7W3:GSEY
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 38
 Goroutines: 160
 System Time: 2018-02-14T11:17:22.172562767Z
 EventsListeners: 2
HTTP Proxy: docker.for.mac.http.internal:3128
HTTPS Proxy: docker.for.mac.http.internal:3129
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
 127.0.0.0/8
Registry Mirrors:
 http://localhost:5000/
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

Docker for Mac

thaJeztah commented 6 years ago

ping @nishanttotla @tiborvass PTAL

odelreym commented 6 years ago

Same behaviour with 18.03.1-ce