docker / for-linux

Docker Engine for Linux
https://docs.docker.com/engine/installation/
754 stars 85 forks source link

process exited, containerd task is stopped, but docker container is running #779

Open honkiko opened 5 years ago

honkiko commented 5 years ago
  1. kubelet was trying to kill the container every minite Sep 09 19:07:47 VM-0-241-ubuntu kubelet[2765]: I0909 19:07:47.989154 2765 kuberuntime_container.go:555] Killing container "docker://a9a1785b81343c3ad2093ad973f4f8e52dbf54823b8bb089886c8356d4036fe0" with 30 second grace period

  2. docker ps and docker inpect shows this docker container is running

  3. linux ps shows the process is gone

  4. containerd shows the task is stopped root@VM-0-241-ubuntu:/home/ubuntu# docker-container-ctr --namespace moby --address /var/run/docker/containerd/docker-containerd.sock task ls |grep a9a1785b81343c3ad2093ad973f4f8e52dbf54823b8bb089886c8356d4036fe0 a9a1785b81343c3ad2093ad973f4f8e52dbf54823b8bb089886c8356d4036fe0 30639 STOPPED

  5. linux ps shows the docker-containerd-shim process is alive root@VM-0-241-ubuntu:/home/ubuntu# ps -elf|grep a9a1785b813 |grep -v grep 4 S root 30621 1 0 80 0 - 2686 futex_ Sep04 ? 00:01:36 docker-containerd-shim -namespace moby -workdir /opt/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/a9a1785b81343c3ad2093ad973f4f8e52dbf54823b8bb089886c8356d4036fe0 -address /var/run/docker/containerd/docker-containerd.sock -containerd-binary /usr/bin/docker-containerd -runtime-root /var/run/docker/runtime-runc

Expected behavior

docker reports this container exited/stopped

Actual behavior

docker reports this container running

Steps to reproduce the behavior

We met this problem several times with docker17.12 and docker18.06. But I didn't found the way to reproduce it.

Output of docker version:

(paste your output here)

Output of docker info:

(paste your output here)

Additional environment details (AWS, VirtualBox, physical, etc.)

honkiko commented 5 years ago

root@VM-0-241-ubuntu:/home/ubuntu# docker info Containers: 36 Running: 32 Paused: 0 Stopped: 4 Images: 80 Server Version: 18.06.3-ce Storage Driver: aufs Root Dir: /opt/docker/aufs Backing Filesystem: extfs Dirs: 541 Dirperm1 Supported: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: e6b3f5632f50dbc4e9cb6288d911bf4f5e95b18e (expected: 468a545b9edcd5932818eb9de8e72413e616e86e) runc version: f56b4cbeadc407e715d9b2ba49e62185bd81cef4 (expected: a592beb5bc4c4092b1b1bac971afed27687340c5) init version: fec3683 Security Options: apparmor seccomp Profile: default Kernel Version: 4.4.0-104-generic Operating System: Ubuntu 16.04.1 LTS OSType: linux Architecture: x86_64 CPUs: 16 Total Memory: 31.29GiB Name: VM-0-241-ubuntu ID: FK7Y:WWLP:PQXJ:UAN7:BJV3:5CU6:D3T3:5B4G:BE7C:27NY:53JC:DPMM Docker Root Dir: /opt/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Registry Mirrors: https://mirror.ccs.tencentyun.com/ Live Restore Enabled: true

thaJeztah commented 5 years ago

Are you able to reproduce on a current version of Docker? (19.03 or 18.09)

honkiko commented 5 years ago

I can't reproduce it, just met it several times with docker 17.x or 18.06. We didn't take use of docker 18.09 and 19.03 yet.

josh-ferrell-sas commented 4 years ago

I have seen this occur in 19.03.4

rtdarwin commented 4 years ago

I have the same issue with docker 18.04.

Here is the docker info, hope it helps.

Containers: 7
 Running: 7
 Paused: 0
 Stopped: 0
Images: 21
Server Version: 18.04
Storage Driver: aufs
 Root Dir: /app/docker/aufs
 Backing Filesystem: extfs
 Dirs: 52
 Dirperm1 Supported: true
Logging Driver: none
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c (expected: 773c489c9c1b21a6d78b5c538cd395416ec50f88)
runc version: 6c55f98695e902427906eed2c799e566e3d3dfb5 (expected: )
init version: 949e6fa
Kernel Version: 3.10.33
Operating System: Buildroot 2017.02
OSType: linux
Architecture: armv7l
CPUs: 4
Total Memory: 978.6MiB
Name: -
ID: FSE3:VRVS:ZVCN:V2WV:FY4V:654A:GV7N:BYPY:EWZO:PGGX:HAX5:F6UN
Docker Root Dir: /app/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
rtdarwin commented 4 years ago

;; Sorry for my poor english. Please tell me If any word ambiguous in this comment.

Hundreds of nodes in our production environment have this issue. I checked some nodes, reviewed the source code, got a little conclusion:

  1. The pid1 process in container died unexpectedly while running. This exit event (SIGCHLD) was detected and well handled by containerd-shim
  2. There might be something wrong when containerd-shim forwarding exit event through events service provided by containerd
  3. dockerd still thinks the container is running as if it didn't receive the event.

Details:

When using containerd-ctr shim state command to investigate the container pid1 state, I got that pid1 process status=3,exit_status=137. By reviewing the source code, I find that only when containerd-shim get a SIGCHLD signal, it will set the process state. Now that the process state changes, we can know that the exit event (SIGCHLD) has been detected and well handled by containerd-shim.

The date exited_at reported by containerd-ctr is larger than the date StartedAt reported by docker ps. This proves that the pid1 process in container died unexpectedly when running, rather than starting.

By reviewing the code, I found that when shim forwarding events to dockerd, it forks a containerd publish process to do that. The fork may fail when the system is OOM, and shim will not try again.

docker ps reports the containers list only by listing the containers recorded in its container.memoryStore. Except the nonexistent pid1 process, everything else in dockerd looks running, as if the dockerd didn't get the exit event.


Chinese | 中文:

我们生产环境上有几百台机器有这种情况。这两天我抽查了一些节点,再加上翻查源码,有一点点结论:

  1. 容器 pid1 进程是运行中意外死亡的,dockerd-shim 感知到并进行了处理
  2. 猜测 shim 通过 containerd 的 events 服务做消息转发的时候可能出了问题
  3. dockerd 仍认为容器处于运行状态

具体解释如下:

使用 docker-containerd-ctr shim state 命令直连 dockerd-shim 查看 pid1 进程的状态,ctr 命令会返回容器的 status=3,exit_status=137,这可以证明 containerd-shim 已经感知并处理了容器 pid1 进程死亡事件。

ctr 所返回的进程退出时间 exited_at 明显晚于 docker inspect 所列出的容器启动时间 StartedAt r,这可以证明容器是运行中意外死亡的。

通过 containerd/linux/shim/service.go 源码,可以看到 shim 的 Service 在执行p.SetExited() 之后把事件扔进了 channel,同时 shim 的 forward 协程通过 for 循环从 channel 中取事件 Publish 给 containerd。如果我们信任 golang 的运行时的话,可以认为 shim 的职责已经完成了,已经正确处理了事件并转发。

但这里有一点值得注意,就是,shim 是通过 containerd publish 命令将事件发送给 contaienrd 的。如果容器 pid1 进程是 OOM 死亡的,当前系统中很有可能没有足够的内存资源可用,导致命令执行失败。而 containerd publish 命令的执行没有重试,很有可能是这个地方出错导致事件丢失。

查看 docker/damon.Damon..ProcessEvent() 过程,我没有发现明显的能导致事件处理中断的逻辑。而且在 ProcessEvent() 一开始的时候就会对容器 SetStopped() 来设置容器状态,而通过 docker ps 仍看到容器是运行状态,很大可能并不是 dockerd 处理事件出错,而是 dockerd 根本没有收到事件。

rtdarwin commented 4 years ago

https://github.com/containerd/cri/issues/1132

events might not be reliable.

thaJeztah commented 4 years ago

Looks like you're running on a very old (and EOL'd) version of both Docker (18.04) and containerd (1.0.2); both of those are no longer maintained; also, kernel 3.10.x is only supported / tested on for the Red Hat kernel (which has a great number of patches from newer kernels back ported)

fho commented 4 years ago

I think I run into this issue with docker 19.03.12-ce + docker-compose 1.27.0-rc2. I have a docker container running in status unhealthy.

CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS                   PORTS                      NAMES
194c9be890cb        postgres:9.6.13-alpine   "docker-entrypoint.s…"   6 days ago          Up 3 hours (unhealthy)   127.0.0.1:5432->5432/tcp   workspace_postgres_1

It's process (postgresql) is not running anymore (not shown in ps).

docker stop or docker kill does not terminate it:

level=info msg="Container 194c9be890cbcf8a58d0fa8843ddace2d3fd0a3c9410d61f5933cb4261d40fbf failed to exit within 10 seconds of signal 15 - using the force"
level=info msg="Container 194c9be890cb failed to exit within 10 seconds of kill - trying direct SIGKILL"

The containerd-shim process is still running:

root       12217  6.5  0.0 712432 13968 ?        Sl   12:48  10:54 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/194c9be890cbcf8a58d0fa8843ddace2d3fd0a3c9410d61f5933cb4261d40fbf -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
yxxhero commented 2 years ago

any updates?

OpenSourceAries commented 9 months ago

The same problem with docker-ce 18.09.9 and containerd 1.2.6

docker container is running and containerd-shim process is alive, but containerd task stopped. At this time, restart docker service(systemctl restart docker) will stuck.

ctr -n moby t ls
TASK                                                                PID        STATUS
12a1338eefa4aa9b785289e4d044e7a0ab563f9b682b87e332a39400dbadc572    1784720    STOPPED

ps -ef | grep 12a1338eefa4aa9b785289e4d044e7a0ab563f9b682b87e332a39400dbadc572
root     1784702    1489  0 15:29 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/12a1338eefa4aa9b785289e4d044e7a0ab563f9b682b87e332a39400dbadc572 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc

docker ps
12a1338eefa4        minio:RELEASE.2021-08-20T18-32-01Z   "/entry.sh /bin/sh -…"   About an hour ago   Up About an hour                            miniocluster

everything back to normal after kill contained-shim process(sometimes we need to restart docker service)

kill -9 1784702