deis / router

Edge router for Deis Workflow
https://deis.com
MIT License
80 stars 57 forks source link

Router makes zombies permanently #331

Open Bregor opened 7 years ago

Bregor commented 7 years ago

Kubernetes:

$ kubectl version --short
Client Version: v1.5.4
Server Version: v1.5.4

Deis:

$ helm list
NAME                REVISION    UPDATED                     STATUS      CHART                       NAMESPACE
deis-workflow       6           Thu Mar  9 12:41:00 2017    DEPLOYED    workflow-v2.12.0            deis

Router:

$ kubectl get deployment -n deis deis-router -o jsonpath='{.spec.template.spec.containers[0].image}'
quay.io/deis/router:v2.11.0

Zombies (from ps auxffww):

_apt     30939  0.1  0.0 566916 18104 ?        Ssl  Mar09   2:39      |   \_ /opt/router/sbin/router
_apt     30959  0.0  0.0   4540    80 ?        S    Mar09   0:00      |       \_ cat
_apt     30972  0.0  0.0  28580  5128 ?        S    Mar09   0:00      |       \_ nginx: master process /opt/router/sbin/nginx
_apt     21911  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21912  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21913  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21914  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21915  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21916  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21917  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21918  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21919  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21920  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21921  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     21922  0.0  0.0  28580  3136 ?        S    14:14   0:00      |       |   \_ nginx: worker process
_apt     30986  0.0  0.0      0     0 ?        Z    Mar09   0:00      |       \_ [nginx] <defunct>
_apt       557  0.0  0.0      0     0 ?        Z    Mar09   0:00      |       \_ [nginx] <defunct>
_apt      3734  0.0  0.0      0     0 ?        Z    Mar09   0:00      |       \_ [nginx] <defunct>
_apt      7037  0.0  0.0      0     0 ?        Z    Mar09   0:00      |       \_ [nginx] <defunct>
_apt      9379  0.0  0.0      0     0 ?        Z    Mar09   0:00      |       \_ [nginx] <defunct>
_apt      5466  0.0  0.0      0     0 ?        Z    Mar09   0:00      |       \_ [nginx] <defunct>
_apt     12015  0.0  0.0      0     0 ?        Z    Mar09   0:00      |       \_ [nginx] <defunct>
_apt     18298  0.0  0.0      0     0 ?        Z    10:46   0:00      |       \_ [nginx] <defunct>
_apt     19393  0.0  0.0      0     0 ?        Z    10:47   0:00      |       \_ [nginx] <defunct>
_apt     22430  0.0  0.0      0     0 ?        Z    14:02   0:00      |       \_ [nginx] <defunct>
_apt     24104  0.0  0.0      0     0 ?        Z    14:03   0:00      |       \_ [nginx] <defunct>
_apt     24564  0.0  0.0      0     0 ?        Z    14:03   0:00      |       \_ [nginx] <defunct>
_apt     25887  0.0  0.0      0     0 ?        Z    14:04   0:00      |       \_ [nginx] <defunct>
_apt     21910  0.0  0.0      0     0 ?        Z    14:14   0:00      |       \_ [nginx] <defunct>
vdice commented 7 years ago

Related: https://github.com/kubernetes/kubernetes/issues/39334 and https://github.com/weaveworks/weave/issues/2836

bacongobbler commented 7 years ago

I'm able to see this as well with v2.12.0 after running for a few minutes:

><> kd get po | grep router
deis-router-1001573613-2rc22             1/1       Running   0          16m
><> kd exec deis-router-1001573613-2rc22 ps faux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
router      25  0.0  0.1  34428  2952 ?        Rs   16:01   0:00 ps faux
router       1  0.0  1.1  97376 23152 ?        Ssl  15:46   0:00 /opt/router/sbin/router
router       7  0.0  0.0   4540   632 ?        S    15:46   0:00 cat
router      14  0.0  0.1  28332  3868 ?        S    15:46   0:00 nginx: master process /opt/router/sbin/nginx
router      23  0.0  0.2  28332  4156 ?        S    15:48   0:00  \_ nginx: worker process
router      24  0.0  0.1  28332  2480 ?        S    15:48   0:00  \_ nginx: worker process
router      16  0.0  0.0      0     0 ?        Z    15:46   0:00 [nginx] <defunct>
router      19  0.0  0.0      0     0 ?        Z    15:47   0:00 [nginx] <defunct>
router      22  0.0  0.0      0     0 ?        Z    15:48   0:00 [nginx] <defunct>

Going to try downgrading and see if I can diagnose when this started to occur.

bacongobbler commented 7 years ago

I was able to reproduce this using router versions v2.9.0, v2.10.0, v2.11.0, and the canary release. All of them showed zombie processes.

><> kd get po deis-router-1097387089-f6r4n -o yaml | grep canary | head -n 1
    image: quay.io/deisci/router:canary
><> kd exec deis-router-1097387089-f6r4n ps faux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
router      31  0.0  0.1  34428  2892 ?        Rs   17:02   0:00 ps faux
router       1  0.1  1.4 162912 29220 ?        Ssl  17:00   0:00 /opt/router/sbin/router
router       8  0.0  0.0   4540   656 ?        S    17:00   0:00 cat
router      12  0.0  0.3  28312  6192 ?        S    17:00   0:00 nginx: master process /opt/router/sbin/nginx
router      22  0.0  0.1  28312  2592 ?        S    17:01   0:00  \_ nginx: worker process
router      23  0.0  0.1  28312  2592 ?        S    17:01   0:00  \_ nginx: worker process
router      15  0.0  0.0      0     0 ?        Z    17:00   0:00 [nginx] <defunct>
router      18  0.0  0.0      0     0 ?        Z    17:00   0:00 [nginx] <defunct>
router      21  0.0  0.0      0     0 ?        Z    17:01   0:00 [nginx] <defunct>

I wonder if this does have to do with https://github.com/kubernetes/kubernetes/issues/39334 as @vdice mentioned, which in that case it should be resolved by upgrading to k8s v1.6.

vdice commented 7 years ago

@Bregor are you still seeing behavior like this on k8s clusters >= 1.5?

Bregor commented 7 years ago

@vdice

$ kubectl version --short
Client Version: v1.6.2
Server Version: v1.5.7

...
_apt      1215  0.0  0.0      0     0 ?        Z    Apr14   0:00 [nginx] <defunct>
_apt      1241  0.0  0.0      0     0 ?        Z    Apr22   0:00 [nginx] <defunct>
_apt      2170  0.0  0.0      0     0 ?        Z    Apr14   0:00 [nginx] <defunct>
_apt      2550  0.0  0.0      0     0 ?        Z    Apr27   0:00 [nginx] <defunct>
_apt      3355  0.0  0.0      0     0 ?        Z    Apr28   0:00 [nginx] <defunct>
...
Bregor commented 7 years ago
$ helm list
NAME                REVISION    UPDATED                     STATUS      CHART                       NAMESPACE
deis-workflow       7           Fri Apr  7 18:59:53 2017    DEPLOYED    workflow-v2.13.0            deis
Bregor commented 7 years ago

@vdice same here with kubernetes-1.6.2 (both client and server)

felixbuenemann commented 6 years ago

I am also seeing this with around 1879 nginx zombies for the router pod. I also grepped the logs for "Router configuration has changed in k8s" and it was logged 1879 times, so zombie processed get produced during config reload.

If you look at nginx/commands.go the nginx server is reloaded by calling "nginx -s reload" using os.Exec()/ cmd.Start() but there are no calls to cmd.Wait(), so when the "nginx -s reload" command finishes, it is not cleaned up and creates a zombie process.

This is literally a one line fix, so I'll create a PR and maybe it still gets merged, even though Deis Workflow is EOL.

kingdonb commented 5 years ago

This issue was resolved in teamhephy/router#6

Deis team: we had someone find this issue when searching, and it was their problem. Turned out they are still using Deis Workflow. Our advice to them was to upgrade to the latest Hephy Workflow, which there is guidance on how to do at www.teamhephy.com.

Do you think we should get someone to go through all of the open issues, and mark them as closed (perhaps with a note to check with github.com/teamhephy/workflow for follow-up if help is needed?)

I don't want to make extra work for anyone, but maybe there is a script for cleaning up EOL repos out there somewhere already...