Yolean / kubernetes-kafka

Kafka cluster as Kubernetes StatefulSet, plain manifests and config
Apache License 2.0
1.83k stars 738 forks source link

Fix Zookeeper readiness probe timeouts #340

Closed solsson closed 3 years ago

solsson commented 3 years ago

kubectl describe on our zookeeper pods occasionally logged Warning Unhealthy 12s (x18 over 3m32s) kubelet Readiness probe errored: rpc error: code = DeadlineExceeded desc = failed to exec in container: timeout 1s exceeded: context deadline exceeded. The Kafka cluster was still healthy, but we noticed that on Containerd nodes every such timeout left a process like this one:

root     2973844   16185  0 Feb10 ?        00:00:00 [sh] <defunct>

(on newer kafka clusters the user is nonroot instead of root)

These processes stayed around forever, so that long lived nodes eventuall hit their thread limit. The reason we hadn't noticed the issue earlier was that, until recently, all our long lived nodes ran Dockerd.