Closed SeaLife closed 4 years ago
Having the same issue on docker with the exact same version:
Output of docker version
is identical with that of @SeaLife
Output of docker info
:
Client:
Debug Mode: false
Server:
Containers: 42
Running: 28
Paused: 0
Stopped: 14
Images: 79
Server Version: 19.03.12-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: la1w82p1e7x0bizt1so45eqkf
Is Manager: true
ClusterID: 8b8sa0fabj54tym6i9c9f2mo9
Managers: 1
Nodes: 1
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Data Path Port: 4789
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 192.168.100.68
Manager Addresses:
192.168.100.68:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 09814d48d50816305a8e6c1a4ae3e2bcc4ba725a.m
runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 5.8.5-arch1-1
Operating System: Arch Linux
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 38.85GiB
Name: cln738
ID: B537:FFGY:5SKJ:WUJB:TY6K:DFF6:FRZQ:A3N4:6YJ7:M7NI:BKSS:GNGQ
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
XXXXX
XXXXX
XXXXX
Live Restore Enabled: false
We are using docker swarm to orchestrate our stack.
I've took a look into journalctl. One error-batch that constantly reoccurs looks like the following: In the beginning of the day there were some different errors:
Sep 01 08:53:07 cln738 dockerd[1836]: time="2020-09-01T08:53:07.900748348+02:00" level=warning msg="Health check for container 7d47ebc8d12edc790ae7c5487e805f381b27c7bf220bc25f4ce89a872fba693c error: context deadline exceeded"
After that, the errors defaulted to something like this:
Sep 01 10:53:53 cln738 dockerd[1836]: time="2020-09-01T10:53:53.496805141+02:00" lev
el=error msg="stream copy error: reading from a closed fifo"
Sep 01 10:53:53 cln738 dockerd[1836]: time="2020-09-01T10:53:53.496802896+02:00" lev
el=error msg="stream copy error: reading from a closed fifo"
Sep 01 10:53:53 cln738 dockerd[1836]: time="2020-09-01T10:53:53.501591839+02:00" lev
el=warning msg="Health check for container 4ecddc486708fa3f5397606b8ba5f56173a24db95
791abb41c556cff79546c67 error: OCI runtime exec failed: exec failed: cannot exec a c
ontainer that has stopped: unknown"
Those errors did not occur before the update (At least not so many. Before the update I had just a couple of those per day, when services were really failing, now im in the area of 1000+ of those messages in a matter of hours).
Seeing this exact same thing as well. Same exact docker version
as OP. docker rm -f
is the only thing that'll bring down the unhealthy containers.
I'd add that "Unhealthy containers can't be stopped" is only a symptom of the greater issue: All my containers with health checks eventually go unhealthy, all around the same time. It seems that the Docker daemon loses the ability to exec into containers, or container networking otherwise drops (see @TheTaz25's note from the daemon logs).
Probably caused by this issue https://github.com/containerd/containerd/issues/4509
Just leaving a comment here that I am facing those issues as well since the latest updates of manjaro (stable). Containers become unhealthy after some time and can no longer be stopped except with docker rm -f
as described by others.
Just for the record. Output of docker version:
Client:
Version: 19.03.12-ce
API version: 1.40
Go version: go1.14.5
Git commit: 48a66213fe
Built: Sat Jul 18 01:33:21 2020
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 19.03.12-ce
API version: 1.40 (minimum version 1.12)
Go version: go1.14.5
Git commit: 48a66213fe
Built: Sat Jul 18 01:32:59 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.4.0.m
GitCommit: 09814d48d50816305a8e6c1a4ae3e2bcc4ba725a.m
runc:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.18.0
GitCommit: fec3683
And a short extract from my logs:
Sep 02 15:27:56 thanatos dockerd[1984]: time="2020-09-02T15:27:56.636985545+02:00" level=error msg="Error replicating health state for container 655dba7b94be5ba431a3cf6689bf1d0ce205e46dda26495631a4b0b065011015: open /var/lib/docker/containers/655dba7b94be5ba431a3cf6689bf1d0ce205e46dda26495631a4b0b065011015/.tmp-config.v2.json211938083: no such file or directory"
Sep 02 15:27:56 thanatos dockerd[1984]: time="2020-09-02T15:27:56.704373858+02:00" level=warning msg="Health check for container f6473db4fc19198885df2db0d07d7ea81ab1baa0db9833501f819248886d8bd0 error: No such exec instance: 7a5e30513e76cc8b82cd7b2f4db1a8e11f59f2526dce68e41a1fd37909a3ae4a"
Sep 02 15:27:56 thanatos dockerd[1984]: time="2020-09-02T15:27:56.704427249+02:00" level=error msg="Error replicating health state for container f6473db4fc19198885df2db0d07d7ea81ab1baa0db9833501f819248886d8bd0: open /var/lib/docker/containers/f6473db4fc19198885df2db0d07d7ea81ab1baa0db9833501f819248886d8bd0/.tmp-config.v2.json731252774: no such file or directory"
Sep 02 15:27:56 thanatos dockerd[1984]: time="2020-09-02T15:27:56.747741474+02:00" level=error msg="stream copy error: reading from a closed fifo"
Sep 02 15:27:56 thanatos dockerd[1984]: time="2020-09-02T15:27:56.747764768+02:00" level=error msg="stream copy error: reading from a closed fifo"
Sep 02 15:27:56 thanatos dockerd[1984]: time="2020-09-02T15:27:56.748730884+02:00" level=warning msg="Health check for container 53146d94797d66792cc4b406ad67dfb9f9c831fb51362f2c320fdacb1e84569c error: OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown"
May not be related to docker-ce itself.
My latest update of docker-ce I received at [2020-08-17T08:51:01+0200] [ALPM] upgraded docker (1:19.03.12-1 -> 1:19.03.12-2)
But the unhealty
issues just arrived like about yesterday.
See also https://forums.docker.com/t/docker-exec-gets-stuck-on-exit/98279
[2020-08-28T11:54:59+0200] [ALPM] upgraded containerd (1.3.4-2 -> 1.4.0-2)
could it be?
@dequis You may read the existing comments first. See: https://github.com/docker/for-linux/issues/1097#issuecomment-685419370
Thanks to @razielgn for pointing out the containerd issue. I downgraded containerd to 1.3.4 (specifically 1.3.4-2 on Arch Linux) and confirmed that I no longer see this issue (~1 day since downgrading, anyways).
This is specifically related to how Docker is built on Arch. The container is becoming unhealthy because the healthchecks are timing out because the containerd shim (responsible for executing the healthcheck command) is deadlocked.
Please follow the downstream bug: https://bugs.archlinux.org/task/67755?dev=25983
Thank you!
Expected behavior
The Healthcheck of my containers are working fine and containers with healthcheck in the state of "unhealthy" will stop properly when stopped.
Actual behavior
The healthchecks are working fine for some time but will start to timeout later on. A container in a unhealthy state can't be stopped by
docker stop
. Forcefully removing the container works but the container is not fully removed from the system (the container remains member of assigned networks, ...)Steps to reproduce the behavior
We'r running on arch-linux with the latest docker version (see below) and the latest docker-compose version (see below). I attached a docker-compose stack we'r encountering this issue with (especially the clamav service, as this service got a healthcheck, we removed all other custom healthchecks due to this issue)
Output of
docker version
:Output of
docker info
:5.8.5-arch1-1 #1 SMP PREEMPT Thu, 27 Aug 2020 18:53:02 +0000 x86_64 GNU/Linux
The Docker-Compose Stack we used:
HINT: The
clamav
service comes with a built-in healthcheck, which fails after some hours on this system.docker inspect nextcloud_clamav_1
:https://pastebin.com/Qd0qRPFz (its quite big)