[feature] terminate a process group grandchildren on exec timeout

kabakaev commented 4 years ago

TL/DR Would it be possible to send SIGTERM to all leaf (grandchildren) subprocesses in an exec process group before SIGKILLing them?

What is the problem you're trying to solve There are many popular k8s helm charts which use bash scripts for readinessProbe and livenessProbe. Unfortunately, some of those scripts start a potentially long-running operation in subshell. If such subshell command takes longer than the exec timeoutSeconds setting, then the parent script gets terminated, leaving a zombie PID behind.

A minimal example is given at the botom of this request.

I've submitted a workaround for redis@github/bitnami/charts, but (thousands of) our developers use many other charts/containers with the very same issue. For example, official Elasticsearch chart has the same buggy sub-shell health check. Fixing all those charts upstream is a tedious task...

I'm wondering whether it be possible to solve such common issue on containerd side?

Describe the solution you'd like First i though that it may help to just enable PR_SET_CHILD_SUBREAPER on exec parent.

@crosbymichael, do you know whether an exec's parent sets PR_SET_CHILD_SUBREAPER before running an exec or execSync command in a container? I see that shim does that for both v1 and v2. But i'm not sure, whether shim is the actual parent process when kubelet is calling CRI's execSync()?

Also, it's not clear whether PR_SET_CHILD_SUBREAPER is helpful or not:

The setting of the "child subreaper" attribute is not inherited by children created by fork(2) and clone(2).

So, if a bash script does normal fork() (which i'm not sure about), then subreaper will not help here.

Therefore, I would like to send SIGTERM or SIGQUIT to the leaf child processes in a process group of an execSync() and wait for a few milliseconds. This will allow main script to wait() for subshell exit code and finish normally.

To give an example, let's look at the processes in the pod created via "steps to reproduce" given below:

ps  fwxao pid,ppid,pgid,sid,comm | grep -v defunct
  PID  PPID  PGID   SID COMMAND
12325     0 12325 12325 sh
12330 12325 12325 12325  \_ ping.sh
12331 12330 12325 12325      \_ ping.sh
12332 12331 12325 12325          \_ sleep
    1     0     1     1 sleep
    8     1     1     1 bash
12324     8     1     1  \_ sleep

Here, everything in process group 12325 corresponds to readinessProbe. Containerd only knows about the parent PID 12325, so it SIGKILLs the PID=12325 when exec timeout is reached. But if we (manually) SIGTERM the leaf process in that PG (kill 12332), then the readinessProbe finishes immediately without leaving a zombie/defunct PID behind.

Dear maintainers, would you consider an option of iterating over leaf PIDs and SIGTERMing them before SIGKILLing the whole group?

Additional context

Steps to reproduce a zombie generator.

Click to expand!

- Optional: install fresh [kind](https://github.com/kubernetes-sigs/kind) cluster ```bash kind create cluster --name zombies # Ensuring node image (kindest/node:v1.19.1)... kubectl config use-context kind-zombies ``` - Deploy a test container: ```bash cat <<'EOF' | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: zombie-test labels: app: zombie-test spec: replicas: 1 strategy: type: Recreate selector: matchLabels: app: zombie-test template: metadata: labels: app: zombie-test spec: containers: - name: zombie-test image: debian:10-slim command: - bash - -c - | cat <<'SCRIPT' > /tmp/ping.sh #!/bin/bash -x long_running_health_check=$( sleep 15 echo PONG ) if [[ $long_running_health_check != "PONG" ]]; then echo "$long_running_health_check" exit 1 fi SCRIPT chmod 755 /tmp/ping.sh apt update && apt -y install procps while true; do ps auxwf; echo '=====================' ; sleep 5; done 1>&2 & exec sleep infinity readinessProbe: exec: command: - sh - -c - /tmp/ping.sh timeoutSeconds: 10 initialDelaySeconds: 3 periodSeconds: 20 EOF kubectl logs -f -l app=zombie-test ```

vivekanandpoojari commented 3 years ago

@kabakaev @crosbymichael @AkihiroSuda We are facing a similar issue. We exec into a container and execute a curl command. Due to network configuration , the curl command doesnt return any result We then try to kill the sh and curl command using the group id of the sh process. However only the sh process is being killed and curl becomes a defunct process.

Do you have any workarounds we can use in this case ?

fungaren commented 1 year ago

A python version PoC, a bit simpler:

apiVersion: v1
kind: Pod
metadata:
  name: zombie
spec:
  containers:
  - name: zombie-test
    image: python:3.11
    command:
    - sleep
    - infinity
    readinessProbe:
      exec:
        command:
        - python3
        - -c
        - |
          import os, time
          p = os.fork()
          if p == 0:
            if os.fork() != 0:
              exit(1)
          else:
            os.waitpid(p, 0)
            time.sleep(9999999)
      timeoutSeconds: 3
      initialDelaySeconds: 0
      periodSeconds: 5
  # kill the pod immediately when deleting
  terminationGracePeriodSeconds: 0

The main process must NOT be bash or sh, because it will reap zombies. I use sleep here.
Must call exit() to avoid the python to reap the child process.
See the zombies:

kubectl exec -it zombie -- bash -c 'while true; do ps -xfo pid,ppid,stat,command; echo ---; sleep 5; done'

Output:

    PID    PPID STAT COMMAND
     23       0 Ss+  bash -c while true; do ps -xfo pid,ppid,stat,command; echo ---; sleep 5; done
     29      23 R+    \_ ps -xfo pid,ppid,stat,command
     15       0 Ss   python3 -c import os, time p = os.fork() if p == 0:   if os.fork() != 0:     exit(1) else:   os.waitpid(p, 0)   time.sleep(9999999
      1       0 Ss   sleep infinity
     14       1 Z    [python3] <defunct>
     22       1 Z    [python3] <defunct>

So, the root cause is you called exec sleep infinity in your command, which replaced the default bash process as PID 1 (init), and it of course will not reap the zombies!

kabakaev commented 1 year ago

@fungaren, regarding your previous (deleted) comment about kubelet responsibility to kill the liveness/readiness probe sub-processes, the corresponding kubernetes issue was rejected and closed: https://github.com/kubernetes/kubernetes/issues/81042

I tend to agree with Kubernetes's @SergeyKanzhelev: kubelet cannot be responsible for cleanup of exec zombies, because it knows nothing about actual processes. Kubelet simply calls a container runtime abstraction, which calls CRI ExecSync(ctx, id.ID, cmd, timeout) with a timeout option. It's a job of a container runtime to perform proper cleanup when timeout is reached.

@survivant, is #5153 a duplicate of this bug report?

containerd / containerd

[feature] terminate a process group grandchildren on exec timeout #4594