containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.83k stars 2.42k forks source link

Podman does not remove the overlay storage when the systemd service is restarted during reboot or shutdown #21093

Closed disi closed 10 months ago

disi commented 10 months ago

Issue Description

When the system is rebooted and systemd shuts down, it fails to remove the storage for every container on the system. Once the system comes back up and tries to start the service, it cannot create a new storage, because of the existing one. Dec 27 11:08:07 dombox podman[1525]: time="2023-12-27T11:08:07Z" level=warning msg="Unmounting container \"frigate\" while attempting to delete storage: unmounting \"/var/lib/containers/storage/overlay/300f007294edabcaf4dff453cca24448d74d235076494e326f057262fed864f0/merged\": invalid argument" And when the service tries to start after reboot: dombox podman[1490890]: Error: reading CIDFile: open /run/container-frigate.service.ctr-id: no such file or directory

This worked before and stopped working about a week ago. It happens to all containers with podman systemd services. I can restart, stop, start containers via systemd once they are running, but not reboot or shutdown and start the operating system.

Steps to reproduce the issue

Steps to reproduce the issue

  1. podman run -d -n frigate other options
  2. podman generate systemd -f --new --name frigate
  3. cp container-frigate.service /usr/lib/systemd/system/
  4. systemctl daemon-reload
  5. systemctl enable --now container-frigate.service
  6. reboot

Describe the results you received

Dec 27 11:08:07 dombox podman[1525]: time="2023-12-27T11:08:07Z" level=warning msg="Unmounting container \"frigate\" while attempting to delete storage: unmounting \"/var/lib/containers/storage/overlay/300f007294edabcaf4dff453cca24448d74d235076494e326f057262fed864f0/merged\": invalid argument" And during boot dombox podman[1490890]: Error: reading CIDFile: open /run/container-frigate.service.ctr-id: no such file or directory

Describe the results you expected

clean shutdown of the container during reboot or shutdown

podman info output

`podman-4.6.1-7.el9_3.x86_64`

Client:       Podman Engine
Version:      4.6.1
API Version:  4.6.1
Go Version:   go1.20.10
Built:        Tue Dec 12 22:13:14 2023
OS/Arch:      linux/amd64
host:
  arch: amd64
  buildahVersion: 1.31.3
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: cgroupfs
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.8-1.el9.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.8, commit: 879ca989e09d731947cd8d9cbb41038549bf669d'
  cpuUtilization:
    idlePercent: 90.19
    systemPercent: 2.91
    userPercent: 6.91
  cpus: 4
  databaseBackend: boltdb
  distribution:
    distribution: '"almalinux"'
    version: "9.3"
  eventLogger: journald
  freeLocks: 2033
  hostname: dombox.dom
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.14.0-362.13.1.el9_3.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 30156939264
  memTotal: 33225449472
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.7.0-1.el9.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.7.0
    package: netavark-1.7.0-2.el9_3.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.7.0
  ociRuntime:
    name: crun
    package: crun-1.8.7-1.el9.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.7
      commit: 53a9996ce82d1ee818349bdcc64797a1fa0433c4
      rundir: /run/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: ""
    package: ""
    version: ""
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /bin/slirp4netns
    package: slirp4netns-1.2.1-1.el9.x86_64
    version: |-
      slirp4netns version 1.2.1
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 511700992
  swapTotal: 511700992
  uptime: 0h 7m 53.00s
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.access.redhat.com
  - registry.redhat.io
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 8
    paused: 0
    running: 8
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 103028883456
  graphRootUsed: 18046713856
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 8
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.6.1
  Built: 1702419194
  BuiltTime: Tue Dec 12 22:13:14 2023
  GitCommit: ""
  GoVersion: go1.20.10
  Os: linux
  OsArch: linux/amd64
  Version: 4.6.1

### Podman in a container

No

### Privileged Or Rootless

Privileged

### Upstream Latest Release

Yes

### Additional environment details

Additional environment details

### Additional information

I am not a podman expert and it all ran for a year just fine, this only happened since about a week ago. I have now tried to add Before and After to the services in relation to other container services, but get the same result. The first container it tries to start already fails after reboot.
Only now I noticed these issues. I update the OS (AlmaLinux) regularly.

I made a little script, I run now after every reboot/shutdown, then reboot again as a workaround, after the second reboot the services/containers start as expected:

[root@dombox ~]# cat remove_podman_storage.sh

!/bin/bash

yes | podman rm --storage homeassistant yes | podman rm --storage mosquitto yes | podman rm --storage zigbee2mqtt yes | podman rm --storage wyoming-piper yes | podman rm --storage wyoming-whisper yes | podman rm --storage frigate yes | podman rm --storage pihole yes | podman rm --storage hass-configurator

rhatdan commented 10 months ago

I am not sure what is going on, but it looks like the containers are killed before they cleanup. Have you thought about using quadlets to define your containers? It has the latest podman commands for running containers under systemd, while podman generate systemd generates a systemd, that is a snapshot in time, and could have older bugs.

disi commented 10 months ago

I am not sure what is going on, but it looks like the containers are killed before they cleanup. Have you thought about using quadlets to define your containers? It has the latest podman commands for running containers under systemd, while podman generate systemd generates a systemd, that is a snapshot in time, and could have older bugs.

Thank you for the tip, I'll have a look. Not too many containers need to be moved, I'll try with one container for now.

p.s. it's quite a task rewriting the service to quadlets, but after reboot the only container up, is the one I rewrote:

CONTAINER ID  IMAGE                                   COMMAND     CREATED         STATUS         PORTS       NAMES
71cbf8cb183f  ghcr.io/blakeblackshear/frigate:stable              20 seconds ago  Up 21 seconds              frigate

Disadvantage it still does not run correctly, as it cannot access the OpenVino libs for the GPU RuntimeError: Failed to create plugin /usr/local/lib/python3.9/dist-packages/openvino/libs/libopenvino_intel_gpu_plugin.so for device GPU I guess it has to do with privileged mode in the normal podman, I could not see an option mapping here: https://docs.podman.io/en/latest/markdown/podman-systemd.unit.5.html Also SHMSize dryrun told me it is an unsupported key in the "Container" Section. converting "container-frigate.container": unsupported key 'ShmSize' in group 'Container' in /etc/containers/systemd/container-frigate.container

I'll go back to the normal systemd service for now to get it up and running, except reboot.

pps. OK, Frigate is fixed and running via quadlet now :) It was missing:

AddDevice=/dev/dri/card0
AddDevice=/dev/dri/renderD128

It may be better running these containers with not privileged.

disi commented 10 months ago

So, I have rewritten them all as quadlets and now I have the same issue: And after reboot:

Dec 28 10:16:41 dombox container-frigate[1521]: time="2023-12-28T10:16:41Z" level=warning msg="Unmounting container \"frigate\" while attempting to delete storage: unmounting \"/var/lib/containers/storage/overlay/0c260cff1e6c0ab7b04bae2249df0d518e8a42b13a649701e55390a523e2cfcb/merged\": invalid argument"
Dec 28 10:16:41 dombox container-frigate[1521]: Error: removing storage for container "frigate": unmounting "/var/lib/containers/storage/overlay/0c260cff1e6c0ab7b04bae2249df0d518e8a42b13a649701e55390a523e2cfcb/merged": invalid argument
Dec 28 10:16:41 dombox podman[1521]: 2023-12-28 10:16:41.457749847 +0000 GMT m=+0.025109765 image pull dac652c4cf36785c91195cdba2d45ad2b320ae417ca221fcff91b4817c34d4dd ghcr.io/blakeblackshear/frigate:stable
Dec 28 10:16:41 dombox systemd[1]: container-frigate.service: Main process exited, code=exited, status=125/n/a
Dec 28 10:16:41 dombox systemd[1]: var-lib-containers-storage-overlay.mount: Deactivated successfully.
Dec 28 10:16:41 dombox systemd[1]: container-frigate.service: Failed with result 'exit-code'.
Dec 28 10:16:41 dombox systemd[1]: Failed to start Podman Quadlet container-frigate.
Dec 28 10:16:41 dombox systemd[1]: container-frigate.service: Scheduled restart job, restart counter is at 5.
Dec 28 10:16:41 dombox systemd[1]: Stopped Podman Quadlet container-frigate.
Dec 28 10:16:41 dombox systemd[1]: container-frigate.service: Start request repeated too quickly.
Dec 28 10:16:41 dombox systemd[1]: container-frigate.service: Failed with result 'exit-code'.
Dec 28 10:16:41 dombox systemd[1]: Failed to start Podman Quadlet container-frigate.

The system is running only 8 containers as service. Running my workaround script, I can then start the services one by one.

rhatdan commented 10 months ago

@vrothberg @giuseppe thoughts?

disi commented 10 months ago

Here is the container:

[root@dombox ~]# cat /etc/containers/systemd/container-frigate.container
[Unit]
Description=Podman Quadlet container-frigate

[Container]
ContainerName=frigate

Image=ghcr.io/blakeblackshear/frigate:stable

Network=host

Volume=/etc/localtime:/etc/localtime:ro
Volume=/stratis/frigate:/config
Volume=/frigate:/media/frigate

AddDevice=/dev/dri/card0
AddDevice=/dev/dri/renderD128

AutoUpdate=registry

Environment=FRIGATE_RTSP_PASSWORD=dom
Tmpfs=/tmp/cache:rw

[Service]
Restart=always

[Install]
WantedBy=default.target

I only show the logs for that frigate container to keep it consistent, but the others have the same problem.

vrothberg commented 10 months ago

So, I have rewritten them all as quadlets and now I have the same issue: And after reboot:

I don't see the reading CIDFile error in the logs with Quadlet. Any chance you can use a newer version of Podman?

Luap99 commented 10 months ago

Dup of https://github.com/containers/podman/issues/19913 and https://github.com/containers/podman/issues/19491 I think, basically podman 4.6 has c/storage versions that reports a bunch of errors that we can't handle on unclean shutdown. With podman 4.7 that is fixed