Unhealthy containers cant be stopped

SeaLife commented 4 years ago

[x] This is a bug report
[ ] This is a feature request
[x] I searched existing issues before opening this one

Expected behavior

The Healthcheck of my containers are working fine and containers with healthcheck in the state of "unhealthy" will stop properly when stopped.

Actual behavior

The healthchecks are working fine for some time but will start to timeout later on. A container in a unhealthy state can't be stopped by docker stop. Forcefully removing the container works but the container is not fully removed from the system (the container remains member of assigned networks, ...)

Steps to reproduce the behavior

We'r running on arch-linux with the latest docker version (see below) and the latest docker-compose version (see below). I attached a docker-compose stack we'r encountering this issue with (especially the clamav service, as this service got a healthcheck, we removed all other custom healthchecks due to this issue)

Output of docker version:

Client:
 Version:           19.03.12-ce
 API version:       1.40
 Go version:        go1.14.5
 Git commit:        48a66213fe
 Built:             Sat Jul 18 01:33:21 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.12-ce
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.14.5
  Git commit:       48a66213fe
  Built:            Sat Jul 18 01:32:59 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.4.0.m
  GitCommit:        09814d48d50816305a8e6c1a4ae3e2bcc4ba725a.m
 runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Output of docker info:

Client:
 Debug Mode: false

Server:
 Containers: 11
  Running: 11
  Paused: 0
  Stopped: 0
 Images: 34
 Server Version: 19.03.12-ce
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 09814d48d50816305a8e6c1a4ae3e2bcc4ba725a.m
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.8.5-arch1-1
 Operating System: Arch Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 62.6GiB
 Name: seoka.dev
 ID: O6GQ:737Q:5V2Y:DRKN:2Q2N:F25C:353H:RNF2:LUCV:U7Q3:6XZM:BTZV
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: seoka
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

What	Version
OS	Arch Linux
Kernel	`5.8.5-arch1-1 #1 SMP PREEMPT Thu, 27 Aug 2020 18:53:02 +0000 x86_64 GNU/Linux`
runc	1.0.0-rc92
containerd	v1.4.0.m 09814d48d50816305a8e6c1a4ae3e2bcc4ba725a.m
docker-compose	1.26.2

The Docker-Compose Stack we used:

version: '3.4'
services:
  nextcloud:
    image: nextcloud
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.nextcloud.rule=Host(`cloud.${HOSTNAME}`)"
      - "traefik.http.routers.nextcloud.entrypoints=https,http"
      - "traefik.http.routers.nextcloud.tls.certresolver=letsencrypt"
    networks:
      - web
      - internal
      - db
    tmpfs:
      - /tmp
    volumes:
      - "/srv/docker/nextcloud/data:/var/www/html/data"
      - "/srv/docker/nextcloud/config:/var/www/html/config"
      - "/srv/docker/nextcloud/custom_apps:/var/www/html/custom_apps"
      - "/srv/docker/nextcloud/themes:/var/www/html/themes"
      - "/dev/urandom:/dev/urandom:ro"
      - "/etc/localtime:/etc/localtime:ro"
    restart: unless-stopped

  clamav:
    image: mkodockx/docker-clamav
    labels:
      - "traefik.enable=false"
    networks:
      - internal
    volumes:
      - "/etc/localtime:/etc/localtime:ro"
    restart: unless-stopped

  draw_io:
    image: jonberenguer/draw.io
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.drawio.rule=Host(`drawio.backend.${HOSTNAME}`)"
      - "traefik.http.routers.drawio.entrypoints=https,http"
      - "traefik.http.routers.drawio.tls.certresolver=letsencrypt"
      - "traefik.http.routers.drawio.middlewares=drawio"
      - "traefik.http.middlewares.drawio.addprefix.prefix=/draw.io/war"
      - "traefik.http.services.drawio.loadbalancer.server.port=80"
    volumes:
      - "/etc/localtime:/etc/localtime:ro"
    hostname: drawio.backend.${HOSTNAME}
    networks:
      - web
      - internal
    restart: unless-stopped

  onlyoffice:
    image: onlyoffice/documentserver
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.onlyoffice.rule=Host(`onlyoffice.backend.${HOSTNAME}`)"
      - "traefik.http.routers.onlyoffice.entrypoints=https,http"
      - "traefik.http.routers.onlyoffice.tls.certresolver=letsencrypt"
      - "traefik.http.routers.onlyoffice.middlewares=sslheader"
      - "traefik.http.services.onlyoffice.loadbalancer.server.port=80"
      - "traefik.http.middlewares.sslheader.headers.customrequestheaders.X-Forwarded-Proto=https"
    networks:
      - web
      - internal
    volumes:
      - "/etc/localtime:/etc/localtime:ro"
    restart: unless-stopped

  collabora:
    image: collabora/code
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.collabora.rule=Host(`collabora.backend.${HOSTNAME}`)"
      - "traefik.http.routers.collabora.entrypoints=https,http"
      - "traefik.http.routers.collabora.tls.certresolver=letsencrypt"
      - "traefik.http.routers.collabora.middlewares=sslheader"
      - "traefik.http.services.collabora.loadbalancer.server.port=9980"
      - "traefik.http.services.collabora.loadbalancer.sticky=true"
      - "traefik.http.services.collabora.loadbalancer.sticky.cookie.secure=true"
    networks:
      - web
      - internal
    volumes:
      - "/etc/localtime:/etc/localtime:ro"
    restart: unless-stopped
    cap_add:
      - MKNOD
    environment:
      - DONT_GEN_SSL_CERT=true
      - domain=cloud\\.${HOSTNAME}
      - username=test
      - "password=XXXXXX"
      - server_name=collabora.backend.${HOSTNAME}
      - dictionaries=de_DE en_GB en_US
      - "extra_params=--o:ssl.enable=false --o:ssl.termination=true"

networks:
  internal:
    driver: bridge
  db:
    external:
      name: db
  web:
    external:
      name: web

HINT: The clamav service comes with a built-in healthcheck, which fails after some hours on this system.

docker inspect nextcloud_clamav_1:

https://pastebin.com/Qd0qRPFz (its quite big)

TheTaz25 commented 4 years ago

Having the same issue on docker with the exact same version:

Output of docker version is identical with that of @SeaLife

Output of docker info:

Client:
 Debug Mode: false

Server:
 Containers: 42
  Running: 28
  Paused: 0
  Stopped: 14
 Images: 79
 Server Version: 19.03.12-ce
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: la1w82p1e7x0bizt1so45eqkf
  Is Manager: true
  ClusterID: 8b8sa0fabj54tym6i9c9f2mo9
  Managers: 1
  Nodes: 1
  Default Address Pool: 10.0.0.0/8  
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 192.168.100.68
  Manager Addresses:
   192.168.100.68:2377
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 09814d48d50816305a8e6c1a4ae3e2bcc4ba725a.m
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.8.5-arch1-1
 Operating System: Arch Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 38.85GiB
 Name: cln738
 ID: B537:FFGY:5SKJ:WUJB:TY6K:DFF6:FRZQ:A3N4:6YJ7:M7NI:BKSS:GNGQ
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  XXXXX
  XXXXX
  XXXXX
 Live Restore Enabled: false

We are using docker swarm to orchestrate our stack.

I've took a look into journalctl. One error-batch that constantly reoccurs looks like the following: In the beginning of the day there were some different errors:

Sep 01 08:53:07 cln738 dockerd[1836]: time="2020-09-01T08:53:07.900748348+02:00" level=warning msg="Health check for container 7d47ebc8d12edc790ae7c5487e805f381b27c7bf220bc25f4ce89a872fba693c error: context deadline exceeded"

After that, the errors defaulted to something like this:

Sep 01 10:53:53 cln738 dockerd[1836]: time="2020-09-01T10:53:53.496805141+02:00" lev
el=error msg="stream copy error: reading from a closed fifo"
Sep 01 10:53:53 cln738 dockerd[1836]: time="2020-09-01T10:53:53.496802896+02:00" lev
el=error msg="stream copy error: reading from a closed fifo"
Sep 01 10:53:53 cln738 dockerd[1836]: time="2020-09-01T10:53:53.501591839+02:00" lev
el=warning msg="Health check for container 4ecddc486708fa3f5397606b8ba5f56173a24db95
791abb41c556cff79546c67 error: OCI runtime exec failed: exec failed: cannot exec a c
ontainer that has stopped: unknown"

Those errors did not occur before the update (At least not so many. Before the update I had just a couple of those per day, when services were really failing, now im in the area of 1000+ of those messages in a matter of hours).

kevin-hanselman commented 4 years ago

Seeing this exact same thing as well. Same exact docker version as OP. docker rm -f is the only thing that'll bring down the unhealthy containers.

kevin-hanselman commented 4 years ago

I'd add that "Unhealthy containers can't be stopped" is only a symptom of the greater issue: All my containers with health checks eventually go unhealthy, all around the same time. It seems that the Docker daemon loses the ability to exec into containers, or container networking otherwise drops (see @TheTaz25's note from the daemon logs).

razielgn commented 4 years ago

Probably caused by this issue https://github.com/containerd/containerd/issues/4509

alexanderschnitzler commented 4 years ago

Just leaving a comment here that I am facing those issues as well since the latest updates of manjaro (stable). Containers become unhealthy after some time and can no longer be stopped except with docker rm -f as described by others.

Just for the record. Output of docker version:

Client:
 Version:           19.03.12-ce
 API version:       1.40
 Go version:        go1.14.5
 Git commit:        48a66213fe
 Built:             Sat Jul 18 01:33:21 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.12-ce
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.14.5
  Git commit:       48a66213fe
  Built:            Sat Jul 18 01:32:59 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.4.0.m
  GitCommit:        09814d48d50816305a8e6c1a4ae3e2bcc4ba725a.m
 runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

And a short extract from my logs:

Sep 02 15:27:56 thanatos dockerd[1984]: time="2020-09-02T15:27:56.636985545+02:00" level=error msg="Error replicating health state for container 655dba7b94be5ba431a3cf6689bf1d0ce205e46dda26495631a4b0b065011015: open /var/lib/docker/containers/655dba7b94be5ba431a3cf6689bf1d0ce205e46dda26495631a4b0b065011015/.tmp-config.v2.json211938083: no such file or directory"
Sep 02 15:27:56 thanatos dockerd[1984]: time="2020-09-02T15:27:56.704373858+02:00" level=warning msg="Health check for container f6473db4fc19198885df2db0d07d7ea81ab1baa0db9833501f819248886d8bd0 error: No such exec instance: 7a5e30513e76cc8b82cd7b2f4db1a8e11f59f2526dce68e41a1fd37909a3ae4a"
Sep 02 15:27:56 thanatos dockerd[1984]: time="2020-09-02T15:27:56.704427249+02:00" level=error msg="Error replicating health state for container f6473db4fc19198885df2db0d07d7ea81ab1baa0db9833501f819248886d8bd0: open /var/lib/docker/containers/f6473db4fc19198885df2db0d07d7ea81ab1baa0db9833501f819248886d8bd0/.tmp-config.v2.json731252774: no such file or directory"
Sep 02 15:27:56 thanatos dockerd[1984]: time="2020-09-02T15:27:56.747741474+02:00" level=error msg="stream copy error: reading from a closed fifo"
Sep 02 15:27:56 thanatos dockerd[1984]: time="2020-09-02T15:27:56.747764768+02:00" level=error msg="stream copy error: reading from a closed fifo"
Sep 02 15:27:56 thanatos dockerd[1984]: time="2020-09-02T15:27:56.748730884+02:00" level=warning msg="Health check for container 53146d94797d66792cc4b406ad67dfb9f9c831fb51362f2c320fdacb1e84569c error: OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown"

infabo commented 4 years ago

May not be related to docker-ce itself.

My latest update of docker-ce I received at [2020-08-17T08:51:01+0200] [ALPM] upgraded docker (1:19.03.12-1 -> 1:19.03.12-2)

But the unhealty issues just arrived like about yesterday.

infabo commented 4 years ago

[2020-08-28T11:54:59+0200] [ALPM] upgraded containerd (1.3.4-2 -> 1.4.0-2) could it be?

SeaLife commented 4 years ago

@dequis You may read the existing comments first. See: https://github.com/docker/for-linux/issues/1097#issuecomment-685419370

kevin-hanselman commented 4 years ago

Thanks to @razielgn for pointing out the containerd issue. I downgraded containerd to 1.3.4 (specifically 1.3.4-2 on Arch Linux) and confirmed that I no longer see this issue (~1 day since downgrading, anyways).

cpuguy83 commented 4 years ago

This is specifically related to how Docker is built on Arch. The container is becoming unhealthy because the healthchecks are timing out because the containerd shim (responsible for executing the healthcheck command) is deadlocked.

Please follow the downstream bug: https://bugs.archlinux.org/task/67755?dev=25983

Thank you!

docker / for-linux