Open Bregor opened 7 years ago
I'm able to see this as well with v2.12.0 after running for a few minutes:
><> kd get po | grep router
deis-router-1001573613-2rc22 1/1 Running 0 16m
><> kd exec deis-router-1001573613-2rc22 ps faux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
router 25 0.0 0.1 34428 2952 ? Rs 16:01 0:00 ps faux
router 1 0.0 1.1 97376 23152 ? Ssl 15:46 0:00 /opt/router/sbin/router
router 7 0.0 0.0 4540 632 ? S 15:46 0:00 cat
router 14 0.0 0.1 28332 3868 ? S 15:46 0:00 nginx: master process /opt/router/sbin/nginx
router 23 0.0 0.2 28332 4156 ? S 15:48 0:00 \_ nginx: worker process
router 24 0.0 0.1 28332 2480 ? S 15:48 0:00 \_ nginx: worker process
router 16 0.0 0.0 0 0 ? Z 15:46 0:00 [nginx] <defunct>
router 19 0.0 0.0 0 0 ? Z 15:47 0:00 [nginx] <defunct>
router 22 0.0 0.0 0 0 ? Z 15:48 0:00 [nginx] <defunct>
Going to try downgrading and see if I can diagnose when this started to occur.
I was able to reproduce this using router versions v2.9.0, v2.10.0, v2.11.0, and the canary release. All of them showed zombie processes.
><> kd get po deis-router-1097387089-f6r4n -o yaml | grep canary | head -n 1
image: quay.io/deisci/router:canary
><> kd exec deis-router-1097387089-f6r4n ps faux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
router 31 0.0 0.1 34428 2892 ? Rs 17:02 0:00 ps faux
router 1 0.1 1.4 162912 29220 ? Ssl 17:00 0:00 /opt/router/sbin/router
router 8 0.0 0.0 4540 656 ? S 17:00 0:00 cat
router 12 0.0 0.3 28312 6192 ? S 17:00 0:00 nginx: master process /opt/router/sbin/nginx
router 22 0.0 0.1 28312 2592 ? S 17:01 0:00 \_ nginx: worker process
router 23 0.0 0.1 28312 2592 ? S 17:01 0:00 \_ nginx: worker process
router 15 0.0 0.0 0 0 ? Z 17:00 0:00 [nginx] <defunct>
router 18 0.0 0.0 0 0 ? Z 17:00 0:00 [nginx] <defunct>
router 21 0.0 0.0 0 0 ? Z 17:01 0:00 [nginx] <defunct>
I wonder if this does have to do with https://github.com/kubernetes/kubernetes/issues/39334 as @vdice mentioned, which in that case it should be resolved by upgrading to k8s v1.6.
@Bregor are you still seeing behavior like this on k8s clusters >= 1.5?
@vdice
$ kubectl version --short
Client Version: v1.6.2
Server Version: v1.5.7
...
_apt 1215 0.0 0.0 0 0 ? Z Apr14 0:00 [nginx] <defunct>
_apt 1241 0.0 0.0 0 0 ? Z Apr22 0:00 [nginx] <defunct>
_apt 2170 0.0 0.0 0 0 ? Z Apr14 0:00 [nginx] <defunct>
_apt 2550 0.0 0.0 0 0 ? Z Apr27 0:00 [nginx] <defunct>
_apt 3355 0.0 0.0 0 0 ? Z Apr28 0:00 [nginx] <defunct>
...
$ helm list
NAME REVISION UPDATED STATUS CHART NAMESPACE
deis-workflow 7 Fri Apr 7 18:59:53 2017 DEPLOYED workflow-v2.13.0 deis
@vdice same here with kubernetes-1.6.2 (both client and server)
I am also seeing this with around 1879 nginx zombies for the router pod. I also grepped the logs for "Router configuration has changed in k8s" and it was logged 1879 times, so zombie processed get produced during config reload.
If you look at nginx/commands.go the nginx server is reloaded by calling "nginx -s reload" using os.Exec()
/ cmd.Start()
but there are no calls to cmd.Wait()
, so when the "nginx -s reload" command finishes, it is not cleaned up and creates a zombie process.
This is literally a one line fix, so I'll create a PR and maybe it still gets merged, even though Deis Workflow is EOL.
This issue was resolved in teamhephy/router#6
Deis team: we had someone find this issue when searching, and it was their problem. Turned out they are still using Deis Workflow. Our advice to them was to upgrade to the latest Hephy Workflow, which there is guidance on how to do at www.teamhephy.com.
Do you think we should get someone to go through all of the open issues, and mark them as closed (perhaps with a note to check with github.com/teamhephy/workflow for follow-up if help is needed?)
I don't want to make extra work for anyone, but maybe there is a script for cleaning up EOL repos out there somewhere already...
Kubernetes:
Deis:
Router:
Zombies (from
ps auxffww
):