cri-o / cri-o

Open Container Initiative-based implementation of Kubernetes Container Runtime Interface
https://cri-o.io
Apache License 2.0
5.16k stars 1.06k forks source link

Pod cannot be deleted due to missing container startup command #8272

Open Bevisy opened 3 months ago

Bevisy commented 3 months ago

What happened?

using pod-config.json and container-config.json to create pod:

# cat pod-config.json
{
    "metadata": {
        "name": "nginx-sandbox",
        "namespace": "default",
        "attempt": 1,
        "uid": "hdishd83djaidwnduwk28bcsb"
    },
    "log_directory": "/tmp",
    "linux": {
    }
}

# cat container-config-nginx.json
{
  "metadata": {
      "name": "nginx-0"
  },
  "image":{
      "image": "docker.io/library/nginx:latest"
  },
  "command": [
      "top"
  ],
  "linux": {
  }
}

Then, we could find the container was created failed:

# crictl run container-config-nginx.json pod-config.json
FATA[0012] running container: creating container failed: rpc error: code = Unknown desc = create container: create result: internal/proto/conmon.capnp:Conmon.createContainer: Failed: child command exited with: 1: executable file `top` not found in $PATH: No such file or directory

At this point, the container process on the node becomes a zombie process, and the pod cannot be deleted.

      1   15487   15486    2552 pts/1      11037 Sl       0   0:00 /usr/bin/crio-conmonrs --runtime /usr/bin/crio-crun --runtime-dir /var/lib/containers/storage/overlay-containers/7d46c4f2908be02f02465923ca1aca87295e8872231dae236287fe69209fdec9/userdata --runtime-root /run/crun --log-level info --log-driver systemd --cgroup-manager systemd
  15487   15496   15496   15496 ?             -1 Ss       0   0:00  \_ /pause
  15487   15509   15486    2552 pts/1      11037 Z        0   0:00  \_ [3] <defunct>

What did you expect to happen?

Expect the container process to exit normally instead of becoming a zombie process.

How can we reproduce it (as minimally and precisely as possible)?

See what happened.

Anything else we need to know?

No response

CRI-O and Kubernetes version

```console $ crio --version crio version 1.31.0 Version: 1.31.0 GitCommit: a51dfb336a1d3847415dfa871e81d003e4ef79ae GitCommitDate: 2024-05-21T07:18:21Z GitTreeState: dirty GoVersion: go1.22.3 Compiler: gc Platform: linux/amd64 Linkmode: dynamic BuildTags: containers_image_ostree_stub libdm_no_deferred_remove seccomp selinux LDFlags: unknown SeccompEnabled: true AppArmorEnabled: false ```

OS version

```console # On Linux: $ cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 12 (bookworm)" NAME="Debian GNU/Linux" VERSION_ID="12" VERSION="12 (bookworm)" VERSION_CODENAME=bookworm ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/" $ uname -a Linux lima-crio 6.1.0-21-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.90-1 (2024-05-03) x86_64 GNU/Linux ```

Additional environment details (AWS, VirtualBox, physical, etc.)

nothing else
Bevisy commented 3 months ago

I found that it might be due to conmonrs not correctly waiting for the process to exit. https://github.com/containers/conmon-rs/blob/02e270ff2a227562feadb9731dc4e9f840d7638c/conmon-rs/server/src/child_reaper.rs#L120

saschagrunert commented 3 months ago

Indeed this looks like an issue in conmon-rs :thinking:

github-actions[bot] commented 2 months ago

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] commented 1 month ago

A friendly reminder that this issue had no activity for 30 days.