containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.53k stars 2.39k forks source link

container crashes (`ERRO[51301] Failed to check`…`shouldrestart`); then new containers won't start (`Error: statfs`) #20749

Closed infinitewarp closed 4 months ago

infinitewarp commented 11 months ago

Issue Description

podman (on macOS) containers crash after running for a few days, and new containers that need host directory volume mounts refuse to start.

After some time running normally (it seems to be a few days, but I don't have enough data to establish a pattern), long-running containers crash out with an error like this error dumped to my terminal:

ERRO[51301] Failed to check if abea25c27b5b1f01e5e3c2b9f59db73ea7d0f9cdb95434e179976601ccd37897 should restart: Post "http://d/v4.7.2/libpod/containers/abea25c27b5b1f01e5e3c2b9f59db73ea7d0f9cdb95434e179976601ccd37897/shouldrestart": EOF
Error: Post "http://d/v4.7.2/libpod/containers/abea25c27b5b1f01e5e3c2b9f59db73ea7d0f9cdb95434e179976601ccd37897/wait": EOF

After encountering this error, when I try to run any container with a host directory volume mount defined, it fails to start like this:

❯ podman run -it --rm -e QPC_DBMS=sqlite -p 9999:443 -v "$HOME/.ssh/discovery-keys":/sshkeys registry.redhat.io/discovery/discovery-server-rhel9:1.4.4-1
Error: statfs /Users/brasmith/.ssh/discovery-keys: no such file or directory

The exact same command worked before the crash. The referenced directory definitely exists on the host, and my podman machine was created as podman machine init --volume /Users/brasmith/ to ensure that containers should be allowed to mount any directory under my home.

It is not only that container image that fails; that just happened to be one I was running at the time of the latest crash, and it was in the foreground of my terminal so I saw the error message. The previous time I saw this crash, I was running a completely different container based on fedora (built from a local Dockerfile, not publicly available), and it produced virtually the same error message after running for a few days. So, this error appears to be coming from podman itself, not the contents of the running containers.

Also, any container (I've tried fedora, ubi, and alpine, some from docker hub, some from Quay, and some from Red Hat's registry) with a host directory volume mounts appear to fail the same way after the crash:

❯ podman run -it --rm -v "$HOME/.ssh/discovery-keys":/sshkeys fedora:latest echo "hello world"
Error: statfs /Users/brasmith/.ssh/discovery-keys: no such file or directory

❯ stat "$HOME/.ssh/discovery-keys"
16777220 83935664 drwxr-xr-x 10 brasmith staff 0 320 "Sep  1 16:24:21 2023" "Sep  1 16:24:19 2023" "Sep  1 16:24:19 2023" "Jun 27 10:31:46 2023" 4096 0 0 /Users/brasmith/.ssh/discovery-keys

❯ podman run -it --rm fedora:latest echo "hello world"
hello world

I think I am able to consistently "recover" my system by stopping (podman machine stop) and restarting (podman machine start) the podman machine. After restart, I am able to start a container using that same aforementioned command with no error.

I also removed (podman machine rm) and recreated (podman machine init --volume /Users/brasmith/) my podman machine the previous time this happened, hoping that might resolve the issue, but that seems to have had no effect since, well, here I am.

So, it seems that something is breaking with podman's volume mounting code after "long" container runs of a few days. I've seen this happen at least three times on the latest stable release. Unfortunately, I don't know how to reproduce this failure on demand.

Steps to reproduce the issue

Steps to reproduce the issue

  1. Create podman machine with volume mount path defined. (podman machine init --volume "$HOME")
  2. Create some directory under that path. (mkdir -p "$HOME/.ssh/discovery-keys")
  3. Start a long-running container that uses that path as a volume mount (podman run -it --rm -e QPC_DBMS=sqlite -p 9999:443 -v "$HOME/.ssh/discovery-keys":/sshkeys registry.redhat.io/discovery/discovery-server-rhel9:1.4.4-1 but not necessarily this image and container; other unrelated images exhibit the same problem over time)
  4. Wait a few days…

Describe the results you received

The container crashes and podman prints an error to the terminal like this:

ERRO[51301] Failed to check if abea25c27b5b1f01e5e3c2b9f59db73ea7d0f9cdb95434e179976601ccd37897 should restart: Post "http://d/v4.7.2/libpod/containers/abea25c27b5b1f01e5e3c2b9f59db73ea7d0f9cdb95434e179976601ccd37897/shouldrestart": EOF
Error: Post "http://d/v4.7.2/libpod/containers/abea25c27b5b1f01e5e3c2b9f59db73ea7d0f9cdb95434e179976601ccd37897/wait": EOF

After encountering this error, when I try to run any container with a host directory volume mount defined, it fails to start like this:

❯ podman run -it --rm -e QPC_DBMS=sqlite -p 9999:443 -v "$HOME/.ssh/discovery-keys":/sshkeys registry.redhat.io/discovery/discovery-server-rhel9:1.4.4-1
Error: statfs /Users/brasmith/.ssh/discovery-keys: no such file or directory

Describe the results you expected

Containers do not crash after a few days. I can start new containers with host directory volume mounts without restarting or recreating the podman machine.

podman info output

❯ podman version
Client:       Podman Engine
Version:      4.7.2
API Version:  4.7.2
Go Version:   go1.21.3
Git Commit:   750b4c3a7c31f6573350f0b3f1b787f26e0fe1e3
Built:        Tue Oct 31 07:59:17 2023
OS/Arch:      darwin/amd64

Server:       Podman Engine
Version:      4.7.2
API Version:  4.7.2
Go Version:   go1.21.1
Built:        Tue Oct 31 10:32:01 2023
OS/Arch:      linux/amd64

❯ podman info
host:
  arch: amd64
  buildahVersion: 1.32.0
  cgroupControllers:
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.8-2.fc39.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.8, commit: '
  cpuUtilization:
    idlePercent: 99.82
    systemPercent: 0.11
    userPercent: 0.08
  cpus: 1
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    variant: coreos
    version: "39"
  eventLogger: journald
  freeLocks: 1999
  hostname: localhost.localdomain
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 1000000
    uidmap:
    - container_id: 0
      host_id: 501
      size: 1
    - container_id: 1
      host_id: 100000
      size: 1000000
  kernel: 6.5.11-300.fc39.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 1440722944
  memTotal: 2048516096
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.8.0-1.fc39.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.8.0
    package: netavark-1.8.0-2.fc39.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.8.0
  ociRuntime:
    name: crun
    package: crun-1.11.1-1.fc39.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.11.1
      commit: 1084f9527c143699b593b44c23555fb3cc4ff2f3
      rundir: /run/user/501/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20231004.gf851084-1.fc39.x86_64
    version: |
      pasta 0^20231004.gf851084-1.fc39.x86_64
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/user/501/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.2-1.fc39.x86_64
    version: |-
      slirp4netns version 1.2.2
      commit: 0ee2d87523e906518d34a6b423271e4826f71faf
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 0
  swapTotal: 0
  uptime: 3h 15m 48.00s (Approximately 0.12 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /var/home/core/.config/containers/storage.conf
  containerStore:
    number: 14
    paused: 0
    running: 0
    stopped: 14
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/home/core/.local/share/containers/storage
  graphRootAllocated: 106769133568
  graphRootUsed: 16698638336
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 51
  runRoot: /run/user/501/containers
  transientStore: false
  volumePath: /var/home/core/.local/share/containers/storage/volumes
version:
  APIVersion: 4.7.2
  Built: 1698762721
  BuiltTime: Tue Oct 31 10:32:01 2023
  GitCommit: ""
  GoVersion: go1.21.1
  Os: linux
  OsArch: linux/amd64
  Version: 4.7.2

❯ brew info podman
==> podman: stable 4.7.2 (bottled), HEAD
Tool for managing OCI containers and pods
https://podman.io/
/usr/local/Cellar/podman/4.7.2 (191 files, 53.9MB) *
  Poured from bottle using the formulae.brew.sh API on 2023-11-02 at 11:33:11
From: https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/p/podman.rb
License: Apache-2.0 and GPL-3.0-or-later
==> Dependencies
Build: go ✘, go-md2man ✘, make ✘
Required: qemu ✔
==> Options
--HEAD
    Install HEAD version
==> Caveats
        In order to run containers locally, podman depends on a Linux kernel.
        One can be started manually using `podman machine` from this package.
        To start a podman VM automatically at login, also install the cask
        "podman-desktop".

zsh completions have been installed to:
  /usr/local/share/zsh/site-functions
==> Analytics
install: 16,846 (30 days), 50,618 (90 days), 114,392 (365 days)
install-on-request: 14,177 (30 days), 43,546 (90 days), 103,120 (365 days)
build-error: 1 (30 days)

❯ brew info qemu
==> qemu: stable 8.1.2 (bottled), HEAD
Emulator for x86 and PowerPC
https://www.qemu.org/
/usr/local/Cellar/qemu/8.1.2 (162 files, 530.6MB) *
  Poured from bottle using the formulae.brew.sh API on 2023-11-02 at 11:33:03
From: https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/q/qemu.rb
License: GPL-2.0-only
==> Dependencies
Build: libtool ✔, meson ✘, ninja ✘, pkg-config ✔, spice-protocol ✘
Required: capstone ✔, dtc ✔, glib ✔, gnutls ✔, jpeg-turbo ✔, libpng ✔, libslirp ✔, libssh ✔, libusb ✔, lzo ✔, ncurses ✔, nettle ✔, pixman ✔, snappy ✔, vde ✔, zstd ✔
==> Options
--HEAD
    Install HEAD version
==> Analytics
install: 38,020 (30 days), 153,395 (90 days), 343,410 (365 days)
install-on-request: 29,313 (30 days), 117,950 (90 days), 263,331 (365 days)
build-error: 195 (30 days)

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

My host is a 2019 MacBook Pro (2.6 GHz 6-Core Intel Core i7, 32 GB RAM) running macOS 14 (Sonoma) and the latest release version of podman (4.7.2). I am often but not always connected to the Red Hat corporate VPN (not sure if that's relevant, but VPNs can muck with networking in unexpected ways). I close/sleep my laptop at the end of the work day while leaving containers running, I open/wake it back up in the morning, and containers always come up normally outside of this issue; I mention that because at least twice (the most recent two incidents) I've seen this crash+error happen immediately after opening/waking my laptop.

Additional information

As described above, this seems to happen randomly after running a container for a few days, and maybe it is correlated with opening/waking my laptop from sleep or disconnecting/reconnecting to the Red Hat corporate VPN (since it terminates upon sleep and does not reestablish until I enter my credentials).

So far, it seems I am able to "recover" my system by stopping (podman machine stop) and restarting (podman machine start) the podman machine. After restart, I am able to start a container using that same aforementioned command with no error… at least until it probably mysteriously breaks some days later.

infinitewarp commented 11 months ago

Also, if you can suggest any commands for me to run the next time this happens, I'm happy to try to collect and share any more information or logs that I can. I just restarted my podman machine (as described above), and my system is working again for now, but I'll keep an eye out for any comments here before the next time it happens.

baude commented 11 months ago

is it possible the container is crashing because the volume mount has become unusable or unstable ? i', kind of thinking that is the case here.

infinitewarp commented 11 months ago

That seems to be a reasonable explanation, but I don't see what could cause it. My home directory is on my host's local filesystem. Nothing in its path is network mounted or anything unusual like that.

infinitewarp commented 10 months ago

FWIW, this happens about once every day or two now. If there's anything I can do to collect some useful diagnostic data here, I'm happy to try.

ERRO[34447] Failed to write input to service: EOF
ERRO[34447] Failed to check if 950fbf85f1f0247881255f45f2f42e7d75dfe93845388870422b2c21b6567171 should restart: Post "http://d/v4.7.2/libpod/containers/950fbf85f1f0247881255f45f2f42e7d75dfe93845388870422b2c21b6567171/shouldrestart": EOF
Error: Post "http://d/v4.7.2/libpod/containers/950fbf85f1f0247881255f45f2f42e7d75dfe93845388870422b2c21b6567171/wait": EOF
[root@b8fa9dd10bb2 /]# ERRO[31310] Failed to check if b8fa9dd10bb2afb0e7f67fbe58b65fe4dd03e6a1ecb5c18e35d99028e5495c69 should restart: Post "http://d/v4.7.2/libpod/containers/b8fa9dd10bb2afb0e7f67fbe58b65fe4dd03e6a1ecb5c18e35d99028e5495c69/shouldrestart": EOF
Error: Post "http://d/v4.7.2/libpod/containers/b8fa9dd10bb2afb0e7f67fbe58b65fe4dd03e6a1ecb5c18e35d99028e5495c69/wait": EOF

…and similar messages most mornings when I wake my laptop and try to resume my work. I have to stop and start the podman machine before new containers consistently behave correctly again.

hughsw commented 10 months ago

I have a similar situation, M1 MacBook Pro, using brew, up to date. I run containers I build with podman from debian:12.2. My runtime situation is to have my project's top-level directory mounted into the container and to run tmux in the container so as to be able to have a few development shells in the container. I also have a podman machine ssh session in another terminal, typically running top -H for keeping an eye on things.

As an aside: My experience (several months with this usage pattern) is that podman crashes with some regularity. Depending on whatever update brew has recently installed, it may be frequent (several times a day right now) or occasionally, once every day or two or so.... However, those crashes are usually "silent" (shells and emacs become unresponsive) and I just use podman machine stop/start to get things going again. There has not been enough of a pattern or a hassle for me to file an issue, and I like podman enough to continue using it. And this info is mentioned as an aside to the matter at hand.

Today, I got the following messages, and that has led me to this Issue: In the tmux window:

ERRO[8320] Failed to check if cd331699163960a6366ede36d3d62b28eeeb4feedaaf4818e94975753b0a8f2c should restart: Post "http://d/v4.8.0/libpod/containers/cd331699163960a6366ede36d3d62b28eeeb4feedaaf4818e94975753b0a8f2c/shouldrestart": EOF                                                                                                                                                                       
Error: Post "http://d/v4.8.0/libpod/containers/cd331699163960a6366ede36d3d62b28eeeb4feedaaf4818e94975753b0a8f2c/wait": EOF

And in the podman machine ssh window, which was running top:

Broadcast message from root@localhost (Wed 2023-12-06 14:23:23 EST):

The system will reboot now!

Connection to localhost closed by remote host.

When I try to restart the tmux session again, like the OP I get a statfs message:

Error: statfs /Users/hugh/work/afe/ca/wd: no such file or directory

Stopping and starting the podman machine lets me start the tmux session again.

I can provide more details if you think it would be helpful.

infinitewarp commented 10 months ago

It crashed again today, and I used podman machine ssh to snoop through sudo journalctl -b before restarting the machine VM, but I didn't see anything particularly interesting other than chrony complaining that the clock was wrong. I did close my laptop while I was away over the last few days, and these crashes do usually happen in the next day right after I wake my system from sleep. Could VM clock drift/sync issues be killing the VM's directory mount somehow? Anywhere else I could look to confirm or deny this?

Dec 11 10:22:23 localhost.localdomain systemd-resolved[835]: Clock change detected. Flushing caches.
Dec 07 21:58:44 localhost.localdomain chronyd[963]: System clock wrong by 303819.007901 seconds
Dec 11 10:22:23 localhost.localdomain chronyd[963]: System clock was stepped by 303819.007901 seconds

As usual, restarting the machine worked around the broken mount.

maspetsberger commented 10 months ago

I had similar problems. Basically, my containers would work for a few minutes, and then crash and only work again after podman machine stop && podman machine start.

From what I can tell, if the VM reboots itself, the mounts (e.g. /Users) are not reapplied (Podman Desktop v1.6.3 and Podman Machine 4.8.2). And then any container volume mounts won't work either.

In my case the reason for the reboot was due the Zincati upgrade service which was stuck in a boot-loop because it couldn't apply one of the rpm-ostree upgrades (out-of-disk on /boot). So I had to manually clean up deployments, and then run a fresh rpm-ostree upgrade.

infinitewarp commented 10 months ago

FWIW, I'm still occasionally seeing this original crashing issue with the latest version of podman (4.8.2 at the time of this writing). When I upgraded podman on my macOS host, I also destroyed and created a new podman machine VM. Yes, the new VM still has podman 4.7.2; that's what it created despite 4.8.2 being available elsewhere. 🤷

ERRO[32993] Failed to write input to service: EOF
ERRO[32994] Failed to check if e179ccfcbde082a1475256a2e21900ea5d2eaba4b6fd6631485b700a61638dad should restart: Post "http://d/v4.8.2/libpod/containers/e179ccfcbde082a1475256a2e21900ea5d2eaba4b6fd6631485b700a61638dad/shouldrestart": EOF
Error: Post "http://d/v4.8.2/libpod/containers/e179ccfcbde082a1475256a2e21900ea5d2eaba4b6fd6631485b700a61638dad/wait": EOF

$ podman run --rm -it -v ./repos:/repos --env-file .env downstream-builder:latest
Error: statfs /Users/brasmith/projects/downstream-builder/repos: no such file or directory

$ stat /Users/brasmith/projects/downstream-builder/repos
16777220 75457729 drwxr-xr-x 12 brasmith staff 0 384 "Dec 20 11:50:43 2023" "Dec 20 11:50:15 2023" "Dec 20 11:50:15 2023" "Apr 26 16:50:49 2023" 4096 0 0 /Users/brasmith/projects/downstream-builder/repos

$ podman version
Client:       Podman Engine
Version:      4.8.2
API Version:  4.8.2
Go Version:   go1.21.5
Git Commit:   aa546902fa1a927b3d770528565627d1395b19f3
Built:        Mon Dec 11 05:38:03 2023
OS/Arch:      darwin/amd64

Server:       Podman Engine
Version:      4.7.2
API Version:  4.7.2
Go Version:   go1.21.1
Built:        Tue Oct 31 10:32:01 2023
OS/Arch:      linux/amd64

I continue watching for and trying new releases, and I will keep checking this issue to see if it ever fixes itself. So far, no luck.

rhatdan commented 10 months ago

does podman machine ssh rpm-ostree upgrade do anything?

infinitewarp commented 9 months ago

does podman machine ssh rpm-ostree upgrade do anything?

$ podman machine ssh rpm-ostree upgrade
error: rpmostreed OS operation Upgrade not allowed for user

Negative.

Just to be extra thorough, I destroyed the machine, created it again, opened a shell, tried the upgrade command, saw that it failed because I didn't have the default password, reset the password, and tried it again… 

❯ podman machine rm -f
Waiting for VM to exit...

❯ podman machine init --volume /Users/brasmith
Extracting compressed file: podman-machine-default_fedora-coreos-39.20231204.2.1-qemu.x86_64.qcow2: done
Image resized.
Machine init complete
To start your machine run:

    podman machine start

❯ podman machine start --no-info
Starting machine "podman-machine-default"
Waiting for VM ...
Mounting volume... /Users/brasmith:/Users/brasmith
Machine "podman-machine-default" started successfully

❯ podman machine ssh
Connecting to vm podman-machine-default. To close connection, use `~.` or `exit`
Fedora CoreOS 39.20231204.2.1
Tracker: https://github.com/coreos/fedora-coreos-tracker
Discuss: https://discussion.fedoraproject.org/tag/coreos

core@localhost:~$ rpm-ostree upgrade
==== AUTHENTICATING FOR org.projectatomic.rpmostree1.upgrade ====
Authentication is required to update software
Authenticating as: CoreOS Admin (core)
Password:
polkit-agent-helper-1: pam_authenticate failed: Authentication failure
==== AUTHENTICATION FAILED ====
error: rpmostreed OS operation Upgrade not allowed for user

core@localhost:~$ passwd
Changing password for user core.
Current password:
passwd: Authentication token manipulation error

core@localhost:~$ sudo passwd core
Changing password for user core.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.

core@localhost:~$ rpm-ostree upgrade
==== AUTHENTICATING FOR org.projectatomic.rpmostree1.upgrade ====
Authentication is required to update software
Authenticating as: CoreOS Admin (core)
Password:
==== AUTHENTICATION COMPLETE ====
2 metadata, 0 content objects fetched; 788 B transferred in 1 seconds; 0 bytes content written
Checking out tree 3798d8e... done
Enabled rpm-md repositories: fedora-cisco-openh264 updates fedora updates-archive
Updating metadata for 'fedora-cisco-openh264'... done
Updating metadata for 'updates'... done
Updating metadata for 'fedora'... done
Updating metadata for 'updates-archive'... done
Importing rpm-md... done
rpm-md repo 'fedora-cisco-openh264'; generated: 2023-03-14T10:57:01Z solvables: 4
rpm-md repo 'updates'; generated: 2024-01-03T02:15:52Z solvables: 17411
rpm-md repo 'fedora'; generated: 2023-11-01T00:12:39Z solvables: 70825
rpm-md repo 'updates-archive'; generated: 2024-01-03T02:36:24Z solvables: 20233
Resolving dependencies... done
No upgrade available.

core@localhost:~$ podman --version
podman version 4.7.2

…but after a minute or two of downloading metadata, no updates appear to be available. 🤷

The podman version mismatch may be a red herring, but I look forward to seeing if the behavior of this issue changes once the machine/VM actually gets the latest version of podman.

Since I have no idea what might break following that password change, after this exercise, I destroyed the machine and recreated it again to resume my regular work. I'm happy to try other commands if you have any additional suggestions.

Luap99 commented 6 months ago

Can you retest this with podman 5.0?