containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
22.4k stars 2.31k forks source link

System boot hangs indefinitely on unclean shutdown with transient mode #22984

Open lambinoo opened 3 weeks ago

lambinoo commented 3 weeks ago

Issue Description

I have a setup with multiple quadlet files setup to manage long term services, one shot jobs, pods and volumes. All of this is running on a CentOS 9 platform, with podman in transient mode and a separate filesystem for container storage.

This is running in a system where we can have unclean shutdowns quite frequently.

We've encountered a bug quite recently, where the system seems to hang indefinitely at boot, waiting on a pod/volume/oneshot container service from quadlet forever. Current workaround is to install appropriate timeouts, and have systemd restart the services in that case. This seem to happen after an unclean shutdown.

I have opened a PR that attempt to fix that issue: #22985

Steps to reproduce the issue

  1. Install the quadlet files linked to the issue on the system in /etc/containers/systemd, reboot the system once and wait for all the services to be
  2. Hard-Reboot the system (eg. reboot -f)
  3. Login and run systemctl list-jobs to observe that either the pod or volume service are hanging the system

Quadlet files:

# pod.pod
[Pod]
PodName=mypod
# myvolume.volume
[Unit]
Description=Create volume

[Volume]
Copy=false
GlobalArgs=--log-level=debug
# cntr.container
[Container]
Image=docker.io/library/ubuntu:latest
Volume=myvolume.volume:/vol
Pod=pod.pod
Exec=sleep infinity

[Install]
WantedBy=multi-user.target

Describe the results you received

System hangs forever during the boot phase

Describe the results you expected

Boot completes without hanging

podman info output

host:
  arch: amd64
  buildahVersion: 1.36.0
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.12-1.el9.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.12, commit: 7ba5bd6c81ff2c10e07aee8c4281d12a2878fa12'
  cpuUtilization:
    idlePercent: 75.44
    systemPercent: 5.62
    userPercent: 18.93
  cpus: 12
  databaseBackend: sqlite
  distribution:
    distribution: centos
    version: "9"
  eventLogger: journald
  freeLocks: 2031
  hostname: HOSTNAME
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.14.0-430.el9.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 6276349952
  memTotal: 16339382272
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.9.0-1.el9.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.9.0
    package: netavark-1.11.0-1.el9.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.11.0
  ociRuntime:
    name: crun
    package: crun-1.15-1.el9.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.15
      commit: e6eacaf4034e84185fd8780ac9262bbf57082278
      rundir: /run/user/0/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20231204.gb86afe3-1.el9.x86_64
    version: |
      pasta 0^20231204.gb86afe3-1.el9.x86_64
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.3.1-1.el9.x86_64
    version: |-
      slirp4netns version 1.3.1
      commit: e5e368c4f5db6ae75c2fce786e31eef9da6bf236
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 8589930496
  swapTotal: 8589930496
  uptime: 3h 34m 39.00s (Approximately 0.12 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.access.redhat.com
  - registry.redhat.io
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 8
    paused: 0
    running: 8
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 554240225280
  graphRootUsed: 18993541120
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 15
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 5.1.0
  Built: 1717411100
  BuiltTime: Mon Jun  3 10:38:20 2024
  GitCommit: ""
  GoVersion: go1.22.3 (Red Hat 1.22.3-2.el9)
  Os: linux
  OsArch: linux/amd64
  Version: 5.1.0

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

No

Additional environment details

Podman 5.1.0 in transient mode on a Centos 9 based, with a separate filesystem for the container storage in /var/lib/containers. Unclean shutdowns are frequent.

Additional information

No response