Open honkiko opened 5 years ago
root@VM-0-241-ubuntu:/home/ubuntu# docker info Containers: 36 Running: 32 Paused: 0 Stopped: 4 Images: 80 Server Version: 18.06.3-ce Storage Driver: aufs Root Dir: /opt/docker/aufs Backing Filesystem: extfs Dirs: 541 Dirperm1 Supported: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: e6b3f5632f50dbc4e9cb6288d911bf4f5e95b18e (expected: 468a545b9edcd5932818eb9de8e72413e616e86e) runc version: f56b4cbeadc407e715d9b2ba49e62185bd81cef4 (expected: a592beb5bc4c4092b1b1bac971afed27687340c5) init version: fec3683 Security Options: apparmor seccomp Profile: default Kernel Version: 4.4.0-104-generic Operating System: Ubuntu 16.04.1 LTS OSType: linux Architecture: x86_64 CPUs: 16 Total Memory: 31.29GiB Name: VM-0-241-ubuntu ID: FK7Y:WWLP:PQXJ:UAN7:BJV3:5CU6:D3T3:5B4G:BE7C:27NY:53JC:DPMM Docker Root Dir: /opt/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Registry Mirrors: https://mirror.ccs.tencentyun.com/ Live Restore Enabled: true
Are you able to reproduce on a current version of Docker? (19.03 or 18.09)
I can't reproduce it, just met it several times with docker 17.x or 18.06. We didn't take use of docker 18.09 and 19.03 yet.
I have seen this occur in 19.03.4
I have the same issue with docker 18.04.
Here is the docker info, hope it helps.
Containers: 7
Running: 7
Paused: 0
Stopped: 0
Images: 21
Server Version: 18.04
Storage Driver: aufs
Root Dir: /app/docker/aufs
Backing Filesystem: extfs
Dirs: 52
Dirperm1 Supported: true
Logging Driver: none
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c (expected: 773c489c9c1b21a6d78b5c538cd395416ec50f88)
runc version: 6c55f98695e902427906eed2c799e566e3d3dfb5 (expected: )
init version: 949e6fa
Kernel Version: 3.10.33
Operating System: Buildroot 2017.02
OSType: linux
Architecture: armv7l
CPUs: 4
Total Memory: 978.6MiB
Name: -
ID: FSE3:VRVS:ZVCN:V2WV:FY4V:654A:GV7N:BYPY:EWZO:PGGX:HAX5:F6UN
Docker Root Dir: /app/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
;; Sorry for my poor english. Please tell me If any word ambiguous in this comment.
Hundreds of nodes in our production environment have this issue. I checked some nodes, reviewed the source code, got a little conclusion:
containerd-shim
containerd-shim
forwarding exit event through events service provided by containerdDetails:
When using containerd-ctr shim state
command to investigate the container pid1 state, I got that pid1 process status=3,exit_status=137
. By reviewing the source code, I find that only when containerd-shim
get a SIGCHLD signal, it will set the process state. Now that the process state changes, we can know that the exit event (SIGCHLD) has been detected and well handled by containerd-shim
.
The date exited_at
reported by containerd-ctr
is larger than the date StartedAt
reported by docker ps
. This proves that the pid1 process in container died unexpectedly when running, rather than starting.
By reviewing the code, I found that when shim forwarding events to dockerd, it forks a containerd publish
process to do that. The fork may fail when the system is OOM, and shim will not try again.
docker ps
reports the containers list only by listing the containers recorded in its container.memoryStore. Except the nonexistent pid1 process, everything else in dockerd looks running, as if the dockerd didn't get the exit event.
Chinese | 中文:
我们生产环境上有几百台机器有这种情况。这两天我抽查了一些节点,再加上翻查源码,有一点点结论:
具体解释如下:
使用 docker-containerd-ctr shim state
命令直连 dockerd-shim 查看 pid1 进程的状态,ctr 命令会返回容器的 status=3,exit_status=137
,这可以证明 containerd-shim 已经感知并处理了容器 pid1 进程死亡事件。
ctr 所返回的进程退出时间 exited_at
明显晚于 docker inspect 所列出的容器启动时间 StartedAt
r,这可以证明容器是运行中意外死亡的。
通过 containerd/linux/shim/service.go 源码,可以看到 shim 的 Service 在执行p.SetExited() 之后把事件扔进了 channel,同时 shim 的 forward 协程通过 for 循环从 channel 中取事件 Publish 给 containerd。如果我们信任 golang 的运行时的话,可以认为 shim 的职责已经完成了,已经正确处理了事件并转发。
但这里有一点值得注意,就是,shim 是通过 containerd publish
命令将事件发送给 contaienrd 的。如果容器 pid1 进程是 OOM 死亡的,当前系统中很有可能没有足够的内存资源可用,导致命令执行失败。而 containerd publish
命令的执行没有重试,很有可能是这个地方出错导致事件丢失。
查看 docker/damon.Damon..ProcessEvent() 过程,我没有发现明显的能导致事件处理中断的逻辑。而且在 ProcessEvent() 一开始的时候就会对容器 SetStopped() 来设置容器状态,而通过 docker ps 仍看到容器是运行状态,很大可能并不是 dockerd 处理事件出错,而是 dockerd 根本没有收到事件。
https://github.com/containerd/cri/issues/1132
events might not be reliable.
Looks like you're running on a very old (and EOL'd) version of both Docker (18.04) and containerd (1.0.2); both of those are no longer maintained; also, kernel 3.10.x is only supported / tested on for the Red Hat kernel (which has a great number of patches from newer kernels back ported)
I think I run into this issue with docker 19.03.12-ce + docker-compose 1.27.0-rc2. I have a docker container running in status unhealthy.
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
194c9be890cb postgres:9.6.13-alpine "docker-entrypoint.s…" 6 days ago Up 3 hours (unhealthy) 127.0.0.1:5432->5432/tcp workspace_postgres_1
It's process (postgresql) is not running anymore (not shown in ps
).
docker stop
or docker kill
does not terminate it:
level=info msg="Container 194c9be890cbcf8a58d0fa8843ddace2d3fd0a3c9410d61f5933cb4261d40fbf failed to exit within 10 seconds of signal 15 - using the force"
level=info msg="Container 194c9be890cb failed to exit within 10 seconds of kill - trying direct SIGKILL"
The containerd-shim process is still running:
root 12217 6.5 0.0 712432 13968 ? Sl 12:48 10:54 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/194c9be890cbcf8a58d0fa8843ddace2d3fd0a3c9410d61f5933cb4261d40fbf -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
any updates?
The same problem with docker-ce 18.09.9 and containerd 1.2.6
docker container is running and containerd-shim process is alive, but containerd task stopped. At this time, restart docker service(systemctl restart docker) will stuck.
ctr -n moby t ls
TASK PID STATUS
12a1338eefa4aa9b785289e4d044e7a0ab563f9b682b87e332a39400dbadc572 1784720 STOPPED
ps -ef | grep 12a1338eefa4aa9b785289e4d044e7a0ab563f9b682b87e332a39400dbadc572
root 1784702 1489 0 15:29 ? 00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/12a1338eefa4aa9b785289e4d044e7a0ab563f9b682b87e332a39400dbadc572 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
docker ps
12a1338eefa4 minio:RELEASE.2021-08-20T18-32-01Z "/entry.sh /bin/sh -…" About an hour ago Up About an hour miniocluster
everything back to normal after kill contained-shim process(sometimes we need to restart docker service)
kill -9 1784702
kubelet was trying to kill the container every minite
Sep 09 19:07:47 VM-0-241-ubuntu kubelet[2765]: I0909 19:07:47.989154 2765 kuberuntime_container.go:555] Killing container "docker://a9a1785b81343c3ad2093ad973f4f8e52dbf54823b8bb089886c8356d4036fe0" with 30 second grace period
docker ps and docker inpect shows this docker container is running
linux ps shows the process is gone
containerd shows the task is stopped
root@VM-0-241-ubuntu:/home/ubuntu# docker-container-ctr --namespace moby --address /var/run/docker/containerd/docker-containerd.sock task ls |grep a9a1785b81343c3ad2093ad973f4f8e52dbf54823b8bb089886c8356d4036fe0 a9a1785b81343c3ad2093ad973f4f8e52dbf54823b8bb089886c8356d4036fe0 30639 STOPPED
linux ps shows the docker-containerd-shim process is alive
root@VM-0-241-ubuntu:/home/ubuntu# ps -elf|grep a9a1785b813 |grep -v grep 4 S root 30621 1 0 80 0 - 2686 futex_ Sep04 ? 00:01:36 docker-containerd-shim -namespace moby -workdir /opt/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/a9a1785b81343c3ad2093ad973f4f8e52dbf54823b8bb089886c8356d4036fe0 -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-runc
Expected behavior
docker reports this container exited/stopped
Actual behavior
docker reports this container running
Steps to reproduce the behavior
We met this problem several times with docker17.12 and docker18.06. But I didn't found the way to reproduce it.
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.)