Closed trynocoding closed 2 days ago
There are a lot of zombie processes in zookeeper pods,pulsar-broker also has a zombie process
Please share some examples of the process command lines, as text please. Do zombie processes have any information, such as the command line?
Desktop (please complete the following information): [root@master ~]# cat /etc/redhat-release CentOS Stream release 9 [root@master ~]# uname -r 5.14.0-410.el9.x86_64 [root@master ~]#
@trynocoding which k8s implementation are you using?
k8s version: [root@master ~]# kubectl get no NAME STATUS ROLES AGE VERSION master Ready control-plane 38d v1.27.7 [root@master ~]#
UID PID PPID C STIME TTY TIME CMD
pulsar 1 0 0 Nov26 ? 00:08:28 /opt/jvm/bin/java -Dzookeeper.4lw.commands.whitelist= -Dzookeeper.snapshot.trust.empty=true -Dzookeeper.tcpKeepAlive=true -cp /pulsar/conf:::/pulsar/lib/: -Dlog4j2.formatMsgNoL
ookups=true -Dorg.xerial.snappy.use.systemlib=true -Dlog4j.configurationFile=log4j2.yaml -Djute.maxbuffer=10485760 -Djava.net.preferIPv4Stack=true -Dzookeeper.clientTcpKeepAlive=true --add-opens java.base/java.io=ALL-UNNAMED --add
-opens java.base/java.util.zip=ALL-UNNAMED --add-opens java.management/sun.management=ALL-UNNAMED --add-opens jdk.management/com.sun.management.internal=ALL-UNNAMED -Dio.netty.tryReflectionSetAccessible=true -Dorg.apache.pulsar.sh
ade.io.netty.tryReflectionSetAccessible=true --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/jdk.internal.misc=ALL-UNNAMED --add-opens java.base/jdk.internal.platform=ALL-UNNAMED -Xms64m -Xmx128m -XX:+UseG1GC -XX:
MaxGCPauseMillis=10 -Dcom.sun.management.jmxremote -Djute.maxbuffer=10485760 -XX:+ParallelRefProcEnabled -XX:+UnlockExperimentalVMOptions -XX:+DoEscapeAnalysis -XX:+DisableExplicitGC -XX:+ExitOnOutOfMemoryError -XX:+PerfDisableSha
redMem -Xlog:async -Xlog:gc,safepoint:/pulsar/logs/pulsargc%p.log:time,uptime,tags:filecount=10,filesize=20M -Dpulsar.allocator.exit_on_oom=true -Dio.netty.recycler.maxCapacityPerThread=4096 -Dpulsar.log.appender=RoutingAppende
r -Dpulsar.log.dir=/pulsar/logs -Dpulsar.log.level=info -Dpulsar.log.root.level=info -Dpulsar.log.immediateFlush=false -Dpulsar.routing.appender.default=Console -Dlog4j2.is.webapp=false -Dpulsar.functions.process.container.log.dir
=/pulsar/logs -Dpulsar.functions.java.instance.jar=/pulsar/instances/java-instance.jar -Dpulsar.functions.python.instance.file=/pulsar/instances/python-instance/python_instance_main.py -Dpulsar.functions.extra.dependencies.dir=/pu
lsar/instances/deps -Dpulsar.functions.instance.classpath=/pulsar/conf:::/pulsar/lib/: -Dpulsar.functions.log.conf=/pulsar/conf/functions_log4j2.xml -Dbookkeeper.metadata.bookie.drivers=org.apache.pulsar.metadata.bookkeeper.Pulsa
rMetadataBookieDriver -Dbookkeeper.metadata.client.drivers=org.apache.pulsar.metadata.bookkeeper.PulsarMetadataClientDriver -Dpulsar.log.file=zookeeper.log org.apache.zookeeper.server.quorum.QuorumPeerMain /pulsar/conf/zookeeper.c
onf
pulsar 172 1 0 Nov26 ? 00:00:00 [timeout]
pulsar-zookeeper-2:/pulsar$ cat /proc/172/status Name: timeout State: Z (zombie) Tgid: 172 Ngid: 0 Pid: 172 PPid: 1 TracerPid: 0 Uid: 10000 10000 10000 10000 Gid: 0 0 0 0 FDSize: 0 Groups: 0 NStgid: 172 NSpid: 172 NSpgid: 172 NSsid: 172 Threads: 1 SigQ: 0/126951 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000000000 SigCgt: 0000000000000000 CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 00000000a80425fb CapAmb: 0000000000000000 NoNewPrivs: 0 Seccomp: 0 Seccomp_filters: 0 Speculation_Store_Bypass: thread vulnerable SpeculationIndirectBranch: conditional enabled Cpus_allowed: ffffffff,ffffffff,ffffffff Cpus_allowed_list: 0-95 Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001 Mems_allowed_list: 0 voluntary_ctxt_switches: 3 nonvoluntary_ctxt_switches: 0 pulsar-zookeeper-2:/pulsar$
[root@master ~]# kubectl get sts pulsar-zookeeper -oyaml|grep livenessProbe -A12 livenessProbe: exec: command:
As far as I can see, about 2 zombie processes are spawned every 30s, with the same cycle as the livenessProbe cycle, which I'm guessing is caused by the livenessProbe
k8s version: [root@master ~]# kubectl get no NAME STATUS ROLES AGE VERSION master Ready control-plane 38d v1.27.7 [root@master ~]#
@trynocoding which type of k8s implementation is this? How did you install it? Is it minikube, Kind, microk8s, k3s, Rancher Desktop, or any of the typical k8s envs used for k8s development?
I haven't yet tried to reproduce this issue so I haven't checked if this reproduces in my environment.
@lhotari hi,I'm using sealos to deploy a k8s environment in a VM
[root@master images]# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
master Ready control-plane 41d v1.27.7 192.66.111.120
I uninstalled version 4.0.0 pulsar, installed version 3.0.7 pulsar, ran it for a while, and found no zombie processes. [root@master images]# helm list NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION pulsar default 1 2024-12-01 15:47:31.223038065 +0800 CST deployed pulsar-3.6.0 3.0.7
Then, I uninstalled version 3.0.7 pulsar, installed version 4.0.0 pulsar, ran it for a while, and found zombie processes.
Thanks for doing the experiment with 3.0.7, @trynocoding. I found a blog post explaining the possible issue: https://engineeringblog.yelp.com/2016/01/dumb-init-an-init-for-docker.html . I also found https://github.com/kubernetes/kubernetes/issues/84210 .
I'll try reproducing this in my k8s test environments to see if this is specific to the k8s environment.
One significant difference between Pulsar 3.0.x and Pulsar 4.0.x is that Pulsar 3.0.x uses Ubuntu base image and Pulsar 4.0.x uses Alpine base image. That might be contributing to this issue.
I'll try reproducing this in my k8s test environments to see if this is specific to the k8s environment.
I can reproduce the same issue.
pulsar-zookeeper-0:/pulsar$ ps -ef
UID PID PPID C STIME TTY TIME CMD
pulsar 1 0 4 05:51 ? 00:00:07 /opt/jvm/bin/java -Dzookeeper.4lw.commands.whitelist=* -Dzookeeper.snapshot.trus
pulsar 225 1 0 05:51 ? 00:00:00 [timeout] <defunct>
pulsar 237 1 0 05:51 ? 00:00:00 [timeout] <defunct>
pulsar 248 0 0 05:52 pts/0 00:00:00 bash
pulsar 269 1 0 05:52 ? 00:00:00 [timeout] <defunct>
pulsar 281 1 0 05:52 ? 00:00:00 [timeout] <defunct>
pulsar 296 1 0 05:52 ? 00:00:00 [timeout] <defunct>
pulsar 308 1 0 05:52 ? 00:00:00 [timeout] <defunct>
pulsar 326 1 0 05:53 ? 00:00:00 [timeout] <defunct>
pulsar 338 1 0 05:53 ? 00:00:00 [timeout] <defunct>
pulsar 352 1 0 05:53 ? 00:00:00 [timeout] <defunct>
pulsar 364 1 0 05:53 ? 00:00:00 [timeout] <defunct>
pulsar 369 248 0 05:54 pts/0 00:00:00 ps -ef
The zombie processes don't get reaped.
This problem is related to the Alpine base image. I created a minideb based base image of apachepulsar/pulsar-all:4.0.0 using this solution: https://gist.github.com/lhotari/3ffef8117743f7044e6bbdc3933bc029 It is pushed to lhotari/pulsar-all:4.0.0-minideb.
The problem doesn't reproduce when installing with this image (--set defaultPulsarImageRepository=lhotari/pulsar-all,defaultPulsarImageTag=4.0.0-minideb
).
helm install pulsar apache/pulsar --set defaultPulsarImageRepository=lhotari/pulsar-all,defaultPulsarImageTag=4.0.0-minideb --set volumes.persistence=false --set affinity.anti_affinity=false --version 3.7.0 --set kube-prometheus-stack.enabled=false
We'll have to try to find a solution that solves the problem for the Alpine base image. All resources that I found were mentioning https://github.com/krallin/tini or https://github.com/Yelp/dumb-init as the solution. tini
is available as a apk package for Alpine.
Thank you for providing useful conclusions and information. I'm not very familiar with operating systems. Could you help explain the differences between Alpine and Ubuntu in terms of signal handling and process management? In a container, the business process is PID 1, which implies that the business process lacks the ability to reap child processes, thus leading to the existence of zombie processes. Why, in the case of Ubuntu, can these zombie processes be reaped? This may be beyond the scope of this question
@trynocoding The timeout
wrapper added in #214 seems to be problematic with Alpine. Reverting that change is a possible workaround.
Thank you for providing useful conclusions and information. I'm not very familiar with operating systems. Could you help explain the differences between Alpine and Ubuntu in terms of signal handling and process management? In a container, the business process is PID 1, which implies that the business process lacks the ability to reap child processes, thus leading to the existence of zombie processes. Why, in the case of Ubuntu, can these zombie processes be reaped? This may be beyond the scope of this question
In the case of docker containers, the operating system itself doesn't play a major role, but the libraries that the image uses from the operating system base image. A major difference between Alpine and Ubuntu is that Alpine uses musl
C standard library and Ubuntu uses glibc
C standard library.
@trynocoding The timeout wrapper added in https://github.com/apache/pulsar-helm-chart/pull/214 seems to be problematic with Alpine. Reverting that change is a possible workaround.
Yeah,It works.
If the k8s version is higher than 1.20, this can be a solution, the best way may be to add dumb-init or tini for the container to manage the business process as you said, thank you very much for your help!
Yeah,It works.
If the k8s version is higher than 1.20, this can be a solution, the best way may be to add dumb-init or tini for the container to manage the business process as you said, thank you very much for your help!
It doesn't seem to be necessary to install dumb-init
or tini
in this case. I made an experiment where I installed coreutils
package into the image. coreutils
includes the timeout
utility. By default, the Alpine image will use busybox
to provide timeout
.
I experimented with a docker image built with this type of Dockerfile which adds coreutils
package:
FROM apachepulsar/pulsar-all:4.0.0
USER 0
RUN apk add --no-cache coreutils
USER 10000
I created https://github.com/apache/pulsar/pull/23667 to address this problem.
I have also created #556 to address the issue since the timeout wrapper for the probes isn't needed.
It doesn't seem to be necessary to install dumb-init or tini in this case. I made an experiment where I installed coreutils package into the image. coreutils includes the timeout utility. By default, the Alpine image will use busybox to provide timeout.
pulsar-400-zookeeper-2:/pulsar$ apk info -W $(which timeout)
/usr/bin/timeout symlink target is owned by busybox-1.36.1-r29
pulsar-400-zookeeper-2:/pulsar$ ls -l $(which timeout)
lrwxrwxrwx 1 root root 12 Sep 6 11:34 /usr/bin/timeout -> /bin/busybox
pulsar-400-zookeeper-2:/pulsar$ ls -l /bin/busybox
-rwxr-xr-x 1 root root 808712 Jun 10 07:11 /bin/busybox
pulsar-400-zookeeper-2:/pulsar$
I've learned something, thank you very much. If https://github.com/apache/pulsar-helm-chart/pull/556 is reverted, it seems that no other pods are using timeouts. Using the timeout provided by coreutils is still a good approach.
I have also created #556 to address the issue since the timeout wrapper for the probes isn't needed.
Why revert,https://github.com/apache/pulsar-helm-chart/pull/214 was meant to address no longer exist?
I have also created #556 to address the issue since the timeout wrapper for the probes isn't needed.
Why revert,#214 was meant to address no longer exist?
Describe the bug There are a lot of zombie processes in zookeeper pods,pulsar-broker also has a zombie process
To Reproduce Steps to reproduce the behavior: helm install pulsar apache/pulsar --set volumes.persistence=false --set affinity.anti_affinity=false --version 3.7.0 --set kube-prometheus-stack.enabled=false --set components.pulsar_manager=true
[root@master ~]# helm list NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION pulsar default 1 2024-11-25 22:35:04.273744005 -0500 -0500 deployed pulsar-3.7.0 4.0.0
[root@master ~]#
Expected behavior No zombie processes
Screenshots
Desktop (please complete the following information): [root@master ~]# cat /etc/redhat-release CentOS Stream release 9 [root@master ~]# uname -r 5.14.0-410.el9.x86_64 [root@master ~]#