docker / for-linux

Docker Engine for Linux
https://docs.docker.com/engine/installation/
756 stars 85 forks source link

Stack deploys don't always finish #677

Open ablotim opened 5 years ago

ablotim commented 5 years ago

Expected behavior

When we deploy a docker stack on our swarm all service instances should update to the new image version

Actual behavior

Quite often containers keep running an older version of the image

Steps to reproduce the behavior

Output of docker version:

Client:
 Version:           18.09.6
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        481bc77
 Built:             Sat May  4 02:35:27 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.6
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       481bc77
  Built:            Sat May  4 01:59:36 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 18.09.6
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
 NodeID: xises55c4rzac8c99uyfiujcx
 Is Manager: true
 ClusterID: 6yl2zzlsh5cbonkch06mjtl3l
 Managers: 3
 Nodes: 12
 Default Address Pool: 10.0.0.0/8  
 SubnetSize: 24
 Orchestration:
  Task History Retention Limit: 2
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.100.16.44
 Manager Addresses:
  10.100.16.44:2377
  10.100.16.45:2377
  10.100.16.46:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-148-generic
Operating System: Ubuntu 16.04.6 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.66GiB
Name: docker3044
ID: VVUC:N3EG:PBPH:2WV6:GZO3:MLLC:5RZP:OZBE:O24L:HIEY:Q5FR:RQFN
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.)

example of a service that often exhibits this behaviour:

  php-api:
    image: dockerreg.internal/product-php-api-prod:${CI_PIPELINE_ID}
    volumes:
      - /etc/hostname:/etc/hostname.docker
    environment:
      APPLICATION_MODE: production
      DOCKER_SERVICE: php-api-production
    deploy:
      replicas: 15
      update_config:
        parallelism: 1
        order: start-first
        delay: 10s
      resources:
        limits:
          cpus: "4"
          memory: "4G"
        reservations:
          cpus: "2"
          memory: "2G"
      placement:
        constraints:
          - node.labels.cores == 32

example outputs of "docker service ps":

ID                  NAME                                          IMAGE                                         NODE                DESIRED STATE       CURRENT STATE           ERROR               PORTS
2ke1p5ip3q40        product-api_nginx-api.1       dockerreg.internal/product-nginx-api-prod:40102   docker3025          Running             Running 3 minutes ago                       
pl6oqnvm9kbx         \_ product-api_nginx-api.1   dockerreg.internal/product-nginx-api-prod:40082   docker3037          Shutdown            Running 18 hours ago                        
pzw9813wq3sf        product-api_nginx-api.2       dockerreg.internal/product-nginx-api-prod:40102   docker3028          Running             Running 3 minutes ago                       
ID                  NAME                                             IMAGE                                            NODE                DESIRED STATE       CURRENT STATE             ERROR               PORTS
20is4kcrnp5s        product-api_nginx-intapi.1       dockerreg.internal/product-nginx-intapi-prod:40102   docker3036          Running             Preparing 3 minutes ago                       
w2e6bt1kfhet         \_ product-api_nginx-intapi.1   dockerreg.internal/product-nginx-intapi-prod:40082   docker3037          Running             Running 18 hours ago                          
ku0n11zupo2f        product-api_nginx-intapi.2       dockerreg.internal/product-nginx-intapi-prod:40102   docker3028          Running             Running 3 minutes ago                         
ID                  NAME                                        IMAGE                                       NODE                DESIRED STATE       CURRENT STATE           ERROR               PORTS
chfck83zrc97        product-api_php-api.1       dockerreg.internal/product-php-api-prod:40102   docker3016          Running             Running 3 minutes ago                       
6jqs5agvrsdl         \_ product-api_php-api.1   dockerreg.internal/product-php-api-prod:40082   docker3037          Shutdown            Running 18 hours ago                        
ym4fli3ihe6q        product-api_php-api.2       dockerreg.internal/product-php-api-prod:40102   docker3027          Running             Running 3 minutes ago                       

also

20is4kcrnp5s         \_ product-api_nginx-intapi.1   dockerreg.internal/product-nginx-intapi-prod:40102   docker3036          Running             Preparing 13 minutes ago                       
ylb2ef4fca7s         \_ product-api_nginx-backend.1   dockerreg.internal/product-nginx-backend-prod:40102   docker3036          Running             Starting 14 minutes ago                          
obwmutwhn55n         \_ product-api_nginx-intapi.1   dockerreg.internal/product-nginx-intapi-prod:40102   docker3036          Running             Starting 14 minutes ago                       
njb68j5h54p8         \_ product-api_php-api.1   dockerreg.internal/product-php-api-prod:40102   docker3036          Running             Starting 14 minutes ago                       
bxsc24irxs76         \_ product-api_php-backend.1   dockerreg.internal/product-php-backend-prod:40102   docker3036          Running             Preparing 14 minutes ago            

We noticed the issues most often are on the same handful of nodes, notably docker3036 and docker3037, but the only this 'special' about these is that they're also running redis containers. When we restart dockerd things always get back to normal, though we occasionally need to restart it twice.

tafelpootje commented 5 years ago

It seems the process running in the container (the php process) did get killed but the container itself is not cleaned up correctly. The only workaround we found until now is to restart the dockerd on the host which cleans up the zombie containers and returns the host to a somewhat stable state.