fluent / fluent-operator

Operate Fluent Bit and Fluentd in the Kubernetes way - Previously known as FluentBit Operator
Apache License 2.0
577 stars 245 forks source link

fluent-operator process zombie #183

Open KevinLiangX opened 2 years ago

KevinLiangX commented 2 years ago

if fluent-operator process is in zombie status , it can not recover by itself. can not add liveness probe or do something for hearbeat

wenchajun commented 2 years ago

It doesn't usually cause this problem, can you show me the logs?

KevinLiangX commented 2 years ago

hello this is just our test case for DFX 1、find out the fluent-operator docker process [root@k8s-4 ~]# docker ps |grep fluentbit-operator 892bf193483f rvm:5100/kubesphere/fluentbit-operator "/manager" 2 days ago Up 2 days k8s_fluentbit-operator_fluentbit-operator-85855568c6-6ng9f_kubesphere-logging-system_89dfe4d6-a84a-463e-980d-48c88801fe37_0 5c72dc88ab71 rvm:5100/fitcontainer/pause:3.2 "/pause" 2 days ago Up 2 days k8s_POD_fluentbit-operator-85855568c6-6ng9f_kubesphere-logging-system_89dfe4d6-a84a-463e-980d-48c88801fe37_0 2、 check the docker process [root@k8s-4 ~]# docker top 892bf193483f UID PID PPID C STIME TTY TIME CMD 65532 14995 14978 0 Dec13 ? 00:04:25 /manager [root@k8s-4 ~]# [root@k8s-4 ~]# ps aux |grep 14995 65532 14995 0.1 0.0 743776 52756 ? Ssl Dec13 4:25 /manager root 36638 0.0 0.0 112716 960 pts/0 S+ 01:31 0:00 grep --color=auto 14995 [root@k8s-4 ~]# 3、Simulate this zombie scenario [root@k8s-4 ~]# kill -STOP 14995 [root@k8s-4 ~]# [root@k8s-4 ~]# [root@k8s-4 ~]# ps aux |grep 14995 65532 14995 0.1 0.0 743776 52756 ? Tsl Dec13 4:25 /manager root 38195 0.0 0.0 112716 960 pts/0 S+ 01:32 0:00 grep --color=auto 14995 [root@k8s-4 ~]#

4、 our recover benchmark is less than 10min 。after 10min this process still Tsl

wenchajun commented 2 years ago

@519859716 Currently, No liveness probe added to deployment's YAML

Are you interested in collaborating on this?

KevinLiangX commented 2 years ago

It's pleasure to involve in our project . we have put it in our development plan. if it work fine ,i will update it in our project . @wenchajun