podman kube down kills pod immediately instead of performing a clean shutdown

alecello commented 1 year ago

Issue Description

While playing around with Podman's Systemd generator (and quadlet) in a Fedora Server 38 VM I noticed that when running systemctl stop <unit> on the service generated from my .kube file the unit enters the failed state due to the main process (the service container's conmon) exiting with code 137, which seems to suggest that conmon got SIGKILL'd. I then started playing with the bare podman kube play/podman kube stop commands (without the generator and/or systemd in the mix) and noticed that when running a service that takes some time to shut down after receiving the stop signal (in my tests I used the marctv/minecraft-papermc-server:latest Minecraft server image from Docker Hub) I get two different behaviors depending on which command I use to stop the service:

podman pod stop testpod: pod takes some time to quit (around 4s on my machine), looking at podman pod logs -f in the meantime shows some container messages related to the server quitting (saving chunk data to disk et al)
podman kube down test.yml: pod quits instantly (the java process too: verified with watch -n 0.1 "ps aux | grep java") and nothing is printed to neither the pod logs (which quits instantly as well) nor the system journal

I went on to replicate the issue on my Arch Linux main machine (both environments used podman 4.5.1) and sure enough the same behavior could be observed

Before opening this issue I tried to remove my custom containers.conf as well as creating a new one that just sets the default container stop timeout to some high value (tried both 600 and 6000 seconds). I also tried podman system prune and podman system reset to no avail. All tests have been run with SELinux in permissive mode (or no selinux at all for Arch) on an otherwise minimally configured system.

I tried to craft a minimal example that triggers the issue on my end, here it is:

apiVersion: core/v1
kind: Pod
metadata:
  name: minecraft
  namespace: default
  labels:
    app: minecraft
spec:
  containers:
    - name: server
      image: marctv/minecraft-papermc-server:latest
      env:
        - name: MEMORYSIZE
          value: 1G

Steps to reproduce the issue

Play a pod for a service that takes a while to quit with podman kube play <filename>
Open secondary terminals running podman pod logs -f and/or a watch -n 0.1 that includes the container process clearly visible
Run podman pod stop <name>
Take note of the time needed and/or log lines printed during the shutdown process by the pod
Restart the pod with podman pod start <name> and await initialization
Run podman kube down <filename>
Take note of the differences in output and/or process timing

Describe the results you received

Pod gets terminated uncleanly despite the stop timeout being configured high

Describe the results you expected

Either one of two outcomes:

The pod gets terminated cleanly (within the bounds of the configured stop timeout) as it happens with podman pod stop
The systemd-generator does not use podman kube down in case this behavior of the command is intentional (if this is the case - I was not able to positively determine this from the man page - maybe a --soft option to kube down may be implemented and used by the generator?)

podman info output

host:
  arch: amd64
  buildahVersion: 1.30.0
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.7-2.fc38.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: '
  cpuUtilization:
    idlePercent: 93.13
    systemPercent: 1.35
    userPercent: 5.52
  cpus: 2
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    version: "38"
  eventLogger: journald
  hostname: [redacted]
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 6.3.8-200.fc38.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 997011456
  memTotal: 2028412928
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.8.5-1.fc38.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.5
      commit: b6f80f766c9a89eb7b1440c0a70ab287434b17ed
      rundir: /run/user/0/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETPCAP
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /bin/slirp4netns
    package: slirp4netns-1.2.0-12.fc38.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 10144440320
  swapTotal: 10590609408
  uptime: 1h 58m 11.00s (Approximately 0.04 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /usr/share/containers/storage.conf
  containerStore:
    number: 9
    paused: 0
    running: 6
    stopped: 3
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 12319719424
  graphRootUsed: 4797210624
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 6
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.5.1
  Built: 1685123928
  BuiltTime: Fri May 26 19:58:48 2023
  GitCommit: ""
  GoVersion: go1.20.4
  Os: linux
  OsArch: linux/amd64
  Version: 4.5.1

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

Yes

Additional environment details

Default settings QEMU virtual machine with a single NAT virtual network interface run in privileged session

Additional information

None

alecello commented 1 year ago

NOTE: After having performed some more tests it looks like the return code of conmon is a separate issue, but the insta-kill problem is nonetheless real

ygalblum commented 1 year ago

Hi Alessandro, Thanks for the details issue. My 2c: Since Quadlet uses podman kube play to start K8S YAML files without parsing the file, it is not designed to know the name of the pod that will be created. As a result, it has to use podman kube down to stop the service. In addition, as you've shown the issue happens regardless to Quadlet. Having said that, we should look into this issue at the podman kube level

vrothberg commented 1 year ago

Thanks for reaching out, @alecello.

This issue has been fixed by commit 08b0d93ea35e. You can see in the tests that the service now transitions correctly to inactive and not failed anymore (https://github.com/containers/podman/commit/08b0d93ea35e59b388b7acf0bdc7464346a83c3a#diff-2a27b9858e6debd198c5d67a930d3dbe4ac2caa7d4bc2752daade3061bef17fcR462). We're close to releasing Podman 4.6 which will reach Fedora right after.

containers / podman