Open kabakaev opened 4 years ago
@kabakaev
@crosbymichael @AkihiroSuda We are facing a similar issue.
We exec into a container and execute a curl command.
Due to network configuration , the curl command doesnt return any result
We then try to kill the sh
and curl
command using the group id of the sh process.
However only the sh
process is being killed and curl becomes a defunct process.
Do you have any workarounds we can use in this case ?
A python version PoC, a bit simpler:
apiVersion: v1
kind: Pod
metadata:
name: zombie
spec:
containers:
- name: zombie-test
image: python:3.11
command:
- sleep
- infinity
readinessProbe:
exec:
command:
- python3
- -c
- |
import os, time
p = os.fork()
if p == 0:
if os.fork() != 0:
exit(1)
else:
os.waitpid(p, 0)
time.sleep(9999999)
timeoutSeconds: 3
initialDelaySeconds: 0
periodSeconds: 5
# kill the pod immediately when deleting
terminationGracePeriodSeconds: 0
bash
or sh
, because it will reap zombies. I use sleep
here.exit()
to avoid the python to reap the child process.kubectl exec -it zombie -- bash -c 'while true; do ps -xfo pid,ppid,stat,command; echo ---; sleep 5; done'
Output:
PID PPID STAT COMMAND
23 0 Ss+ bash -c while true; do ps -xfo pid,ppid,stat,command; echo ---; sleep 5; done
29 23 R+ \_ ps -xfo pid,ppid,stat,command
15 0 Ss python3 -c import os, time p = os.fork() if p == 0: if os.fork() != 0: exit(1) else: os.waitpid(p, 0) time.sleep(9999999
1 0 Ss sleep infinity
14 1 Z [python3] <defunct>
22 1 Z [python3] <defunct>
So, the root cause is you called exec sleep infinity
in your command, which replaced the default bash
process as PID 1 (init), and it of course will not reap the zombies!
@fungaren, regarding your previous (deleted) comment about kubelet responsibility to kill the liveness/readiness probe sub-processes, the corresponding kubernetes issue was rejected and closed: https://github.com/kubernetes/kubernetes/issues/81042
I tend to agree with Kubernetes's @SergeyKanzhelev: kubelet cannot be responsible for cleanup of exec zombies, because it knows nothing about actual processes. Kubelet simply calls a container runtime abstraction, which calls CRI ExecSync(ctx, id.ID, cmd, timeout)
with a timeout option. It's a job of a container runtime to perform proper cleanup when timeout is reached.
@survivant, is #5153 a duplicate of this bug report?
TL/DR Would it be possible to send
SIGTERM
to all leaf (grandchildren) subprocesses in an exec process group beforeSIGKILL
ing them?What is the problem you're trying to solve There are many popular k8s helm charts which use bash scripts for readinessProbe and livenessProbe. Unfortunately, some of those scripts start a potentially long-running operation in subshell. If such subshell command takes longer than the exec
timeoutSeconds
setting, then the parent script gets terminated, leaving a zombie PID behind.A minimal example is given at the botom of this request.
I've submitted a workaround for redis@github/bitnami/charts, but (thousands of) our developers use many other charts/containers with the very same issue. For example, official Elasticsearch chart has the same buggy sub-shell health check. Fixing all those charts upstream is a tedious task...
I'm wondering whether it be possible to solve such common issue on containerd side?
Describe the solution you'd like First i though that it may help to just enable
PR_SET_CHILD_SUBREAPER
on exec parent.@crosbymichael, do you know whether an exec's parent sets
PR_SET_CHILD_SUBREAPER
before running an exec or execSync command in a container? I see that shim does that for both v1 and v2. But i'm not sure, whethershim
is the actual parent process whenkubelet
is calling CRI's execSync()?Also, it's not clear whether
PR_SET_CHILD_SUBREAPER
is helpful or not:So, if a bash script does normal fork() (which i'm not sure about), then subreaper will not help here.
Therefore, I would like to send
SIGTERM
orSIGQUIT
to the leaf child processes in a process group of an execSync() and wait for a few milliseconds. This will allow main script towait()
for subshell exit code and finish normally.To give an example, let's look at the processes in the pod created via "steps to reproduce" given below:
Here, everything in process group 12325 corresponds to readinessProbe. Containerd only knows about the parent PID 12325, so it
SIGKILL
s the PID=12325 when exec timeout is reached. But if we (manually)SIGTERM
the leaf process in that PG (kill 12332
), then the readinessProbe finishes immediately without leaving a zombie/defunct PID behind.Dear maintainers, would you consider an option of iterating over leaf PIDs and
SIGTERM
ing them beforeSIGKILL
ing the whole group?Additional context
Steps to reproduce a zombie generator.
Click to expand!
- Optional: install fresh [kind](https://github.com/kubernetes-sigs/kind) cluster ```bash kind create cluster --name zombies # Ensuring node image (kindest/node:v1.19.1)... kubectl config use-context kind-zombies ``` - Deploy a test container: ```bash cat <<'EOF' | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: zombie-test labels: app: zombie-test spec: replicas: 1 strategy: type: Recreate selector: matchLabels: app: zombie-test template: metadata: labels: app: zombie-test spec: containers: - name: zombie-test image: debian:10-slim command: - bash - -c - | cat <<'SCRIPT' > /tmp/ping.sh #!/bin/bash -x long_running_health_check=$( sleep 15 echo PONG ) if [[ $long_running_health_check != "PONG" ]]; then echo "$long_running_health_check" exit 1 fi SCRIPT chmod 755 /tmp/ping.sh apt update && apt -y install procps while true; do ps auxwf; echo '=====================' ; sleep 5; done 1>&2 & exec sleep infinity readinessProbe: exec: command: - sh - -c - /tmp/ping.sh timeoutSeconds: 10 initialDelaySeconds: 3 periodSeconds: 20 EOF kubectl logs -f -l app=zombie-test ```