dragonflyoss / Dragonfly

This repository has be archived and moved to the new repository https://github.com/dragonflyoss/Dragonfly2.
https://d7y.io
Apache License 2.0
6k stars 774 forks source link

Dfdaemon calls dfget and zombie mileage occurs #997

Open ftlynx opened 5 years ago

ftlynx commented 5 years ago

Question

docker pull 后,dfget出现了僵尸进程。采用dragonflyoss/dfclient:0.4.3 镜像部署 [root@host-192-168-55-118 logs]# ps -ef | grep dfget root 8717 18498 0 13:58 ? 00:00:00 [dfget] root 8725 18498 0 13:58 ? 00:00:00 [dfget] root 8727 18498 0 13:58 ? 00:00:00 [dfget] root 8738 18498 0 13:58 ? 00:00:00 [dfget] root 18198 22070 0 14:20 pts/1 00:00:00 grep --color=auto dfget root 18910 18498 0 13:02 ? 00:00:00 [dfget] root 18917 18498 0 13:02 ? 00:00:00 [dfget] root 18926 18498 0 13:02 ? 00:00:00 [dfget] root 18933 18498 0 13:02 ? 00:00:00 [dfget] root 21832 18498 0 13:10 ? 00:00:00 [dfget]

zhouhaibing089 commented 5 years ago

https://github.com/dragonflyoss/Dragonfly/blob/e51ade6ca36f097018bf7a035aa2f7a233d65e50/dfdaemon/downloader/dfget/dfget.go#L110

This command does not have any timeout configured. We probably should call exec.CommandContext instead.

zcc35357949 commented 5 years ago

Dfget server process is created by dfget process by StartPeerServerProcess.At the beginning and ending of this function, checkPeerServerExist will be invoken to check whether dfget http server is ready. But in concurrent situation dfget server may be created many times and generate many dfget server processes. Only one of them can listen on the peer port, other processes will exit for address already in use before deget process finish. These processes will become defunc process.

When dfget server process does not exist, pulling a multi-layers image can reproduce the problem.

root     14626 14599  2 21:00 pts/1    00:00:00 /data/app/src/github.com/dragonflyoss/Dragonfly/cmd/dfdaemon/dfget -u https://hub.bilibili.co/v2/zhouchencheng/airflow/blobs/sha256:8f601293b2d86141c418eae05f224e0188ed5cb39336d52ac09cfd5556f076cc -o /data/docker/.small-dragonfly/dfdaemon/data/a6c3be83-9dc6-41e9-80a7-a547d91c2677 --home /data/docker/.small-dragonfly --dfdaemon -s 200MB --totallimit 200MB --node 172.16.38.93 --header User-Agent:docker/18.06.3-ce go/go1.10.3 git-commit/d7080c1 kernel/4.9.0-0.bpo.5-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/18.06.3-ce \(linux\)) --header Authorization:Bearer U27TkYtfTiSiMajMSG25zul11PLoCVrxGcjaUwGLrmFjIyGEescHRxg2oDL1zRh4dhlysIWZBXF_Mk-e_0lmy3Y1YFhi0OmMqO2TVSmULY_M50q3vRClbkpLKRCokNESUewj7TyOEGiXBaUuCWuI --header X-Forwarded-For:127.0.0.1 --insecure --cacerts /etc/docker/certs.d/hub.bilibili.co/ca.crt --cacerts /etc/docker/certs.d/hub.bilibili.co/ca.crt
root     14628 14599  2 21:00 pts/1    00:00:00 /data/app/src/github.com/dragonflyoss/Dragonfly/cmd/dfdaemon/dfget -u https://hub.bilibili.co/v2/zhouchencheng/airflow/blobs/sha256:b4aa2612cd306f180973ea3b6e0c151d6e7c3b0f45b0bcb7dcc8f705fdd1ec6f -o /data/docker/.small-dragonfly/dfdaemon/data/2708adfc-2606-4fe5-bdb4-c16ce19c0269 --home /data/docker/.small-dragonfly --dfdaemon -s 200MB --totallimit 200MB --node 172.16.38.93 --header Authorization:Bearer 1PLoCVrxGcjKIP6EaUwGLrmFjIyGEescHRxg2oDL1zRh4dhlysIWZBXF_Mk-e_0lmy3Y1YFhi0OmMwytMHDuo6AQkXzN6MuvIjYqO2TVSmULY_M50q3vRClbkpLKRCokNESUewj7TyOEGiXBaUuCWuI --header X-Forwarded-For:127.0.0.1 --header User-Agent:docker/18.06.3-ce go/go1.10.3 git-commit/d7080c1 kernel/4.9.0-0.bpo.5-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/18.06.3-ce \(linux\)) --insecure --cacerts /etc/docker/certs.d/hub.bilibili.co/ca.crt --cacerts /etc/docker/certs.d/hub.bilibili.co/ca.crt
root     14633 14599  2 21:00 pts/1    00:00:00 /data/app/src/github.com/dragonflyoss/Dragonfly/cmd/dfdaemon/dfget -u https://hub.bilibili.co/v2/zhouchencheng/airflow/blobs/sha256:df634dfeea0efe695a3fc05109e0e9c2b9d2296560cf2e442e77114300cd2cab -o /data/docker/.small-dragonfly/dfdaemon/data/6041561f-348c-48aa-a024-e8708d3bca5e --home /data/docker/.small-dragonfly --dfdaemon -s 200MB --totallimit 200MB --node 172.16.38.93 --header X-Forwarded-For:127.0.0.1 --header User-Agent:docker/18.06.3-ce go/go1.10.3 git-commit/d7080c1 kernel/4.9.0-0.bpo.5-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/18.06.3-ce \(linux\)) --header Authorization:Bearer DL1zRh4dhlysIWZBXF_Mk-e_0lmy3Y1YFhi0OmMwytMHDuo6AQkXzN6MuvIjYqO2TVSmULY_M50q3vRClbkpLKRCokNESUewj7TyOEGiXBaUuCWuI --insecure --cacerts /etc/docker/certs.d/hub.bilibili.co/ca.crt --cacerts /etc/docker/certs.d/hub.bilibili.co/ca.crt
root     14644 14599 16 21:00 pts/1    00:00:00 /data/app/src/github.com/dragonflyoss/Dragonfly/cmd/dfdaemon/dfget -u https://hub.bilibili.co/v2/zhouchencheng/airflow/blobs/sha256:cc1a78bfd46becbfc3abb8a74d9a70a0e0dc7a5809bbd12e814f9382db003707 -o /data/docker/.small-dragonfly/dfdaemon/data/532aa339-914c-4395-99c7-9da77481f2ec --home /data/docker/.small-dragonfly --dfdaemon -s 200MB --totallimit 200MB --node 172.16.38.93 --header X-Forwarded-For:127.0.0.1 --header User-Agent:docker/18.06.3-ce go/go1.10.3 git-commit/d7080c1 kernel/4.9.0-0.bpo.5-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/18.06.3-ce \(linux\)) --header Authorization:Bearer o6AQkXzbkpLKRCokNESUewj7TyOEGiXBaUuCWuI --insecure --cacerts /etc/docker/certs.d/hub.bilibili.co/ca.crt --cacerts /etc/docker/certs.d/hub.bilibili.co/ca.crt
root     14659 14628  2 21:00 pts/1    00:00:00 /data/app/src/github.com/dragonflyoss/Dragonfly/cmd/dfdaemon/dfget server --ip 172.16.38.93 --port 0 --meta /data/docker/.small-dragonfly/meta/host.meta --data /data/docker/.small-dragonfly/data --home /data/docker/.small-dragonfly --expiretime 3m0s --alivetime 5m0s
root     14665 14626  2 21:00 pts/1    00:00:00 [dfget] <defunct>
root     14667 14633  2 21:00 pts/1    00:00:00 [dfget] <defunct>
root     14685 14644  2 21:00 pts/1    00:00:00 [dfget] <defunct>
root     14721  5908  0 21:00 pts/3    00:00:00 /bin/grep --color=auto dfget

@zhouhaibing089 @Starnop

zhouhaibing089 commented 5 years ago

Nice findings!

zhouhaibing089 commented 5 years ago

I'm also curious on whether dfget should wait for the exit status from the spawned dfget server processes?(if not, how could such defunct process happen..) I assume those child processes always live longer than their parent.

zcc35357949 commented 5 years ago

I'm also curious on whether dfget should wait for the exit status from the spawned dfget server processes?(if not, how could such defunct process happen..) I assume those child processes always live longer than their parent.

If dfget server process is not exist, dfget process will create it and it will stay alive for over alivetime. This period is usually much longer than dfget process's alive time. Once dfget process destory, dfget server process's parent will be Pid 1.

zhouhaibing089 commented 5 years ago

/reopen

leopoldxx commented 4 years ago

encounter the same issue when using ver-1.0.0 release.

image image

JiaDong007 commented 4 years ago

I used the docker image of 1.0.0 and encountered the same problem. Does this matter with the size of the image?Can i modify the parameters to solve?

AmazingSkyLine commented 3 years ago

I encountered the similar problem with 1.0.6 docker image on k8s. when I use dfget command to download files, and after the specific alivetime, the dfget server exit, but it's become a zombie process.

you can reproduce this problem just follow simple steps:

  1. dfget -u <some-file> --node <supernode-ip>:8002 -p p2p --totallimit 10G --locallimit 10G --alivetime 3s --expiretime 1s
  2. wait for 3s to dfget server exit
  3. use command top, a zombie process [dfget] spawns

following is the dfserver logs with default 5m alivetime and 3m expiretime:

cat /root/.small-dragonfly/logs/dfserver.log

2021-04-19 03:06:06.184 INFO sign:35-1618801566.184 : ********************
2021-04-19 03:06:06.184 INFO sign:35-1618801566.184 : start peer server...
2021-04-19 03:06:06.190 INFO sign:35-1618801566.184 : start peer server success, host:<ip>, port:61005
2021-04-19 03:06:06.190 INFO sign:35-1618801566.184 : monitor peer server whether is alive, aliveTime:5m0s
2021-04-19 03:06:06.190 INFO sign:35-1618801566.184 : start server gc, expireTime:3m0s
2021-04-19 03:06:06.191 INFO sign:35-1618801566.184 : update total limit to 8589934592
2021-04-19 03:06:49.576 INFO sign:35-1618801566.184 : update total limit to 8589934592
2021-04-19 03:09:51.203 INFO sign:35-1618801566.184 : server gc, delete file:/root/.small-dragonfly/data/<some-file>.service
2021-04-19 03:09:51.204 INFO sign:35-1618801566.184 : server gc, delete file:/root/.small-dragonfly/data/<some-file>.service
2021-04-19 03:11:49.651 INFO sign:35-1618801566.184 : no more task, peer server will stop...
2021-04-19 03:11:49.651 INFO sign:35-1618801566.184 : peer server is shutdown.