NethServer / dev

NethServer issue tracker
https://github.com/NethServer/dev/issues
63 stars 20 forks source link

Podman "Invalid argument" error after node reboot #6916

Closed stephdl closed 1 month ago

stephdl commented 2 months ago

Steps to reproduce on rocky linux, not always reproducible

Expected behavior

I expect that the reboot starts all modules without issue

Actual behavior

It seems that when we start the server, before to start any user service we try to unmount a merged volume. It seems it is a random side effect, some user systemd service fails to unmount it, others succeed

I have seen this failure for openldap, sambaAD, clamav but I others experimented it for rspamd and postfix, all occurred after the reboot

we could found this for instance in my case

 May 02 09:51:29 R1-pve.rocky9-pve.org systemd[2040]: Starting Samba AD Domain Controller...
May 02 09:51:29 R1-pve.rocky9-pve.org podman[6068]: time="2024-05-02T09:51:29+02:00" level=warning msg="Unmounting container \"samba-dc\" while attempting to delete storage: unmounting \"/home/samba1/.local/share/containers/storage/overlay/c07c970255101ffff8fb38162c32beb7ef7884f8d>
May 02 09:51:29 R1-pve.rocky9-pve.org podman[6068]: Error: removing storage for container "samba-dc": unmounting "/home/samba1/.local/share/containers/storage/overlay/c07c970255101ffff8fb38162c32beb7ef7884f8d1a12f4cdd0da18adc1c4873/merged": invalid argument
May 02 09:51:29 R1-pve.rocky9-pve.org systemd[2040]: samba-dc.service: Control process exited, code=exited, status=125/n/a
May 02 09:51:29 R1-pve.rocky9-pve.org systemd[2040]: samba-dc.service: Failed with result 'exit-code'.
May 02 09:51:29 R1-pve.rocky9-pve.org systemd[2040]: Failed to start Samba AD Domain Controller.

and

 May 02 09:51:27 R1-pve.rocky9-pve.org systemd[2039]: Starting OpenLDAP directory server...
May 02 09:51:27 R1-pve.rocky9-pve.org podman[4026]: time="2024-05-02T09:51:27+02:00" level=warning msg="Unmounting container \"openldap\" while attempting to delete storage: unmounting \"/home/openldap1/.local/share/containers/storage/overlay/401b7a7b2369e7283b6ee84ceeb976bef569d7>
May 02 09:51:27 R1-pve.rocky9-pve.org podman[4026]: Error: removing storage for container "openldap": unmounting "/home/openldap1/.local/share/containers/storage/overlay/401b7a7b2369e7283b6ee84ceeb976bef569d7d221a3b57207ce9d063acc75a7/merged": invalid argument
May 02 09:51:27 R1-pve.rocky9-pve.org systemd[2039]: openldap.service: Control process exited, code=exited, status=125/n/a
May 02 09:51:27 R1-pve.rocky9-pve.org systemd[2039]: openldap.service: Failed with result 'exit-code'.
May 02 09:51:27 R1-pve.rocky9-pve.org systemd[2039]: Failed to start OpenLDAP directory server.

reboot of the server at 09:48:14 https://gist.github.com/stephdl/697992bd89198a8bf8673593e0717582

The immediate fix is to rm the merged volumes

runagent -m openldap1 rm -rf ../../.local/share/containers/storage/overlay/*/merged/

however if you try to reboot, the issue is not reproducible as is, it might occur again, or it might occur to another module

[root@R1-pve ~]# podman info
host:
  arch: amd64
  buildahVersion: 1.31.3
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.8-1.el9.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.8, commit: cebaba63f66de0e92cdc7e2a59f39c9208281158'
  cpuUtilization:
    idlePercent: 99.75
    systemPercent: 0.07
    userPercent: 0.18
  cpus: 8
  databaseBackend: boltdb
  distribution:
    distribution: '"rocky"'
    version: "9.3"
  eventLogger: journald
  freeLocks: 2042
  hostname: R1-pve.rocky9-pve.org
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.14.0-362.24.1.el9_3.0.1.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 4394389504
  memTotal: 8057622528
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.7.0-1.el9.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.7.0
    package: netavark-1.7.0-2.el9_3.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.7.0
  ociRuntime:
    name: crun
    package: crun-1.8.7-1.el9.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.7
      commit: 53a9996ce82d1ee818349bdcc64797a1fa0433c4
      rundir: /run/user/0/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: ""
    package: ""
    version: ""
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.1-1.el9.x86_64
    version: |-
      slirp4netns version 1.2.1
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 6874460160
  swapTotal: 6874460160
  uptime: 1h 33m 22.00s (Approximately 0.04 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.access.redhat.com
  - registry.redhat.io
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 3
    paused: 0
    running: 3
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 40811614208
  graphRootUsed: 2911166464
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 13
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.6.1
  Built: 1709719721
  BuiltTime: Wed Mar  6 11:08:41 2024
  GitCommit: ""
  GoVersion: go1.20.12
  Os: linux
  OsArch: linux/amd64
  Version: 4.6.1

and for the user openldap1

[root@R1-pve ~]# runagent -m samba1
runagent: [INFO] starting bash -l
runagent: [INFO] working directory: /home/samba1/.config/state
[samba1@R1-pve state]$ podman ps
CONTAINER ID  IMAGE       COMMAND     CREATED     STATUS      PORTS       NAMES
[samba1@R1-pve state]$ podman info
host:
  arch: amd64
  buildahVersion: 1.31.3
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.8-1.el9.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.8, commit: cebaba63f66de0e92cdc7e2a59f39c9208281158'
  cpuUtilization:
    idlePercent: 94.48
    systemPercent: 1.59
    userPercent: 3.93
  cpus: 8
  databaseBackend: boltdb
  distribution:
    distribution: '"rocky"'
    version: "9.3"
  eventLogger: file
  freeLocks: 2044
  hostname: R1-pve.rocky9-pve.org
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1004
      size: 1
    - container_id: 1
      host_id: 362144
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1004
      size: 1
    - container_id: 1
      host_id: 362144
      size: 65536
  kernel: 5.14.0-362.24.1.el9_3.0.1.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 6383128576
  memTotal: 8057618432
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.7.0-1.el9.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.7.0
    package: netavark-1.7.0-2.el9_3.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.7.0
  ociRuntime:
    name: crun
    package: crun-1.8.7-1.el9.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.7
      commit: 53a9996ce82d1ee818349bdcc64797a1fa0433c4
      rundir: /run/user/1004/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: ""
    package: ""
    version: ""
  remoteSocket:
    path: /run/user/1004/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.1-1.el9.x86_64
    version: |-
      slirp4netns version 1.2.1
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 6874460160
  swapTotal: 6874460160
  uptime: 0h 2m 48.00s
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.access.redhat.com
  - registry.redhat.io
  - docker.io
store:
  configFile: /home/samba1/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/samba1/.local/share/containers/storage
  graphRootAllocated: 19925041152
  graphRootUsed: 9575915520
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 1
  runRoot: /run/user/1004/containers
  transientStore: false
  volumePath: /home/samba1/.local/share/containers/storage/volumes
version:
  APIVersion: 4.6.1
  Built: 1709719721
  BuiltTime: Wed Mar  6 11:08:41 2024
  GitCommit: ""
  GoVersion: go1.20.12
  Os: linux
  OsArch: linux/amd64
  Version: 4.6.1

Components

podman 4.6.1

See also

https://community.nethserver.org/t/mail-cannot-retrieve-filter-configuration/23484 https://community.nethserver.org/t/ns8-merged-error/23386

DavidePrincipi commented 2 months ago

This issue should be fixed in newer Podman releases. Just for testing, we can install the latest one on Rocky Linux 9 with

dnf -y copr enable rhcontainerbot/podman-next
dnf update

The upcoming 9.4 will ship Podman 4.9.

Concerning Debian, we have to wait Trixie (13).


As workaround we could run a script early in the boot process, to remove empty dirs that triggers the error:

( echo 'rmdir /var/lib/containers/storage/overlay/*/merged' ; getent passwd | awk -F : '$3 >= 1000 { print "rmdir " $6 "/.local/share/containers/storage/overlay/*/merged" }' ) | sh -s || :
DavidePrincipi commented 1 month ago

Test case

Check the bug is not reproducible with core 2.8.0-dev.5

DavidePrincipi commented 1 month ago

Cannot reproduce the original bug, but at least the unit does not fail at boot.

VERIFIED

DavidePrincipi commented 1 month ago

Released Core https://github.com/NethServer/ns8-core/releases/tag/2.8.0