Closed infinitewarp closed 4 months ago
Also, if you can suggest any commands for me to run the next time this happens, I'm happy to try to collect and share any more information or logs that I can. I just restarted my podman machine (as described above), and my system is working again for now, but I'll keep an eye out for any comments here before the next time it happens.
is it possible the container is crashing because the volume mount has become unusable or unstable ? i', kind of thinking that is the case here.
That seems to be a reasonable explanation, but I don't see what could cause it. My home directory is on my host's local filesystem. Nothing in its path is network mounted or anything unusual like that.
FWIW, this happens about once every day or two now. If there's anything I can do to collect some useful diagnostic data here, I'm happy to try.
ERRO[34447] Failed to write input to service: EOF
ERRO[34447] Failed to check if 950fbf85f1f0247881255f45f2f42e7d75dfe93845388870422b2c21b6567171 should restart: Post "http://d/v4.7.2/libpod/containers/950fbf85f1f0247881255f45f2f42e7d75dfe93845388870422b2c21b6567171/shouldrestart": EOF
Error: Post "http://d/v4.7.2/libpod/containers/950fbf85f1f0247881255f45f2f42e7d75dfe93845388870422b2c21b6567171/wait": EOF
[root@b8fa9dd10bb2 /]# ERRO[31310] Failed to check if b8fa9dd10bb2afb0e7f67fbe58b65fe4dd03e6a1ecb5c18e35d99028e5495c69 should restart: Post "http://d/v4.7.2/libpod/containers/b8fa9dd10bb2afb0e7f67fbe58b65fe4dd03e6a1ecb5c18e35d99028e5495c69/shouldrestart": EOF
Error: Post "http://d/v4.7.2/libpod/containers/b8fa9dd10bb2afb0e7f67fbe58b65fe4dd03e6a1ecb5c18e35d99028e5495c69/wait": EOF
…and similar messages most mornings when I wake my laptop and try to resume my work. I have to stop and start the podman machine before new containers consistently behave correctly again.
I have a similar situation, M1 MacBook Pro, using brew, up to date. I run containers I build with podman from debian:12.2. My runtime situation is to have my project's top-level directory mounted into the container and to run tmux
in the container so as to be able to have a few development shells in the container. I also have a podman machine ssh
session in another terminal, typically running top -H
for keeping an eye on things.
As an aside: My experience (several months with this usage pattern) is that podman crashes with some regularity. Depending on whatever update brew has recently installed, it may be frequent (several times a day right now) or occasionally, once every day or two or so.... However, those crashes are usually "silent" (shells and emacs become unresponsive) and I just use podman machine stop/start
to get things going again. There has not been enough of a pattern or a hassle for me to file an issue, and I like podman enough to continue using it. And this info is mentioned as an aside to the matter at hand.
Today, I got the following messages, and that has led me to this Issue: In the tmux window:
ERRO[8320] Failed to check if cd331699163960a6366ede36d3d62b28eeeb4feedaaf4818e94975753b0a8f2c should restart: Post "http://d/v4.8.0/libpod/containers/cd331699163960a6366ede36d3d62b28eeeb4feedaaf4818e94975753b0a8f2c/shouldrestart": EOF
Error: Post "http://d/v4.8.0/libpod/containers/cd331699163960a6366ede36d3d62b28eeeb4feedaaf4818e94975753b0a8f2c/wait": EOF
And in the podman machine ssh
window, which was running top:
Broadcast message from root@localhost (Wed 2023-12-06 14:23:23 EST):
The system will reboot now!
Connection to localhost closed by remote host.
When I try to restart the tmux session again, like the OP I get a statfs message:
Error: statfs /Users/hugh/work/afe/ca/wd: no such file or directory
Stopping and starting the podman machine lets me start the tmux session again.
I can provide more details if you think it would be helpful.
It crashed again today, and I used podman machine ssh
to snoop through sudo journalctl -b
before restarting the machine VM, but I didn't see anything particularly interesting other than chrony complaining that the clock was wrong. I did close my laptop while I was away over the last few days, and these crashes do usually happen in the next day right after I wake my system from sleep. Could VM clock drift/sync issues be killing the VM's directory mount somehow? Anywhere else I could look to confirm or deny this?
Dec 11 10:22:23 localhost.localdomain systemd-resolved[835]: Clock change detected. Flushing caches.
Dec 07 21:58:44 localhost.localdomain chronyd[963]: System clock wrong by 303819.007901 seconds
Dec 11 10:22:23 localhost.localdomain chronyd[963]: System clock was stepped by 303819.007901 seconds
As usual, restarting the machine worked around the broken mount.
I had similar problems. Basically, my containers would work for a few minutes, and then crash and only work again after podman machine stop && podman machine start
.
From what I can tell, if the VM reboots itself, the mounts (e.g. /Users
) are not reapplied (Podman Desktop v1.6.3 and Podman Machine 4.8.2). And then any container volume mounts won't work either.
In my case the reason for the reboot was due the Zincati upgrade service which was stuck in a boot-loop because it couldn't apply one of the rpm-ostree upgrades (out-of-disk on /boot
). So I had to manually clean up deployments, and then run a fresh rpm-ostree upgrade
.
FWIW, I'm still occasionally seeing this original crashing issue with the latest version of podman (4.8.2 at the time of this writing). When I upgraded podman on my macOS host, I also destroyed and created a new podman machine VM. Yes, the new VM still has podman 4.7.2; that's what it created despite 4.8.2 being available elsewhere. 🤷
ERRO[32993] Failed to write input to service: EOF
ERRO[32994] Failed to check if e179ccfcbde082a1475256a2e21900ea5d2eaba4b6fd6631485b700a61638dad should restart: Post "http://d/v4.8.2/libpod/containers/e179ccfcbde082a1475256a2e21900ea5d2eaba4b6fd6631485b700a61638dad/shouldrestart": EOF
Error: Post "http://d/v4.8.2/libpod/containers/e179ccfcbde082a1475256a2e21900ea5d2eaba4b6fd6631485b700a61638dad/wait": EOF
$ podman run --rm -it -v ./repos:/repos --env-file .env downstream-builder:latest
Error: statfs /Users/brasmith/projects/downstream-builder/repos: no such file or directory
$ stat /Users/brasmith/projects/downstream-builder/repos
16777220 75457729 drwxr-xr-x 12 brasmith staff 0 384 "Dec 20 11:50:43 2023" "Dec 20 11:50:15 2023" "Dec 20 11:50:15 2023" "Apr 26 16:50:49 2023" 4096 0 0 /Users/brasmith/projects/downstream-builder/repos
$ podman version
Client: Podman Engine
Version: 4.8.2
API Version: 4.8.2
Go Version: go1.21.5
Git Commit: aa546902fa1a927b3d770528565627d1395b19f3
Built: Mon Dec 11 05:38:03 2023
OS/Arch: darwin/amd64
Server: Podman Engine
Version: 4.7.2
API Version: 4.7.2
Go Version: go1.21.1
Built: Tue Oct 31 10:32:01 2023
OS/Arch: linux/amd64
I continue watching for and trying new releases, and I will keep checking this issue to see if it ever fixes itself. So far, no luck.
does podman machine ssh rpm-ostree upgrade
do anything?
does
podman machine ssh rpm-ostree upgrade
do anything?
$ podman machine ssh rpm-ostree upgrade
error: rpmostreed OS operation Upgrade not allowed for user
Negative.
Just to be extra thorough, I destroyed the machine, created it again, opened a shell, tried the upgrade command, saw that it failed because I didn't have the default password, reset the password, and tried it again…
❯ podman machine rm -f
Waiting for VM to exit...
❯ podman machine init --volume /Users/brasmith
Extracting compressed file: podman-machine-default_fedora-coreos-39.20231204.2.1-qemu.x86_64.qcow2: done
Image resized.
Machine init complete
To start your machine run:
podman machine start
❯ podman machine start --no-info
Starting machine "podman-machine-default"
Waiting for VM ...
Mounting volume... /Users/brasmith:/Users/brasmith
Machine "podman-machine-default" started successfully
❯ podman machine ssh
Connecting to vm podman-machine-default. To close connection, use `~.` or `exit`
Fedora CoreOS 39.20231204.2.1
Tracker: https://github.com/coreos/fedora-coreos-tracker
Discuss: https://discussion.fedoraproject.org/tag/coreos
core@localhost:~$ rpm-ostree upgrade
==== AUTHENTICATING FOR org.projectatomic.rpmostree1.upgrade ====
Authentication is required to update software
Authenticating as: CoreOS Admin (core)
Password:
polkit-agent-helper-1: pam_authenticate failed: Authentication failure
==== AUTHENTICATION FAILED ====
error: rpmostreed OS operation Upgrade not allowed for user
core@localhost:~$ passwd
Changing password for user core.
Current password:
passwd: Authentication token manipulation error
core@localhost:~$ sudo passwd core
Changing password for user core.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
core@localhost:~$ rpm-ostree upgrade
==== AUTHENTICATING FOR org.projectatomic.rpmostree1.upgrade ====
Authentication is required to update software
Authenticating as: CoreOS Admin (core)
Password:
==== AUTHENTICATION COMPLETE ====
2 metadata, 0 content objects fetched; 788 B transferred in 1 seconds; 0 bytes content written
Checking out tree 3798d8e... done
Enabled rpm-md repositories: fedora-cisco-openh264 updates fedora updates-archive
Updating metadata for 'fedora-cisco-openh264'... done
Updating metadata for 'updates'... done
Updating metadata for 'fedora'... done
Updating metadata for 'updates-archive'... done
Importing rpm-md... done
rpm-md repo 'fedora-cisco-openh264'; generated: 2023-03-14T10:57:01Z solvables: 4
rpm-md repo 'updates'; generated: 2024-01-03T02:15:52Z solvables: 17411
rpm-md repo 'fedora'; generated: 2023-11-01T00:12:39Z solvables: 70825
rpm-md repo 'updates-archive'; generated: 2024-01-03T02:36:24Z solvables: 20233
Resolving dependencies... done
No upgrade available.
core@localhost:~$ podman --version
podman version 4.7.2
…but after a minute or two of downloading metadata, no updates appear to be available. 🤷
The podman version mismatch may be a red herring, but I look forward to seeing if the behavior of this issue changes once the machine/VM actually gets the latest version of podman.
Since I have no idea what might break following that password change, after this exercise, I destroyed the machine and recreated it again to resume my regular work. I'm happy to try other commands if you have any additional suggestions.
Can you retest this with podman 5.0?
Issue Description
podman (on macOS) containers crash after running for a few days, and new containers that need host directory volume mounts refuse to start.
After some time running normally (it seems to be a few days, but I don't have enough data to establish a pattern), long-running containers crash out with an error like this error dumped to my terminal:
After encountering this error, when I try to run any container with a host directory volume mount defined, it fails to start like this:
The exact same command worked before the crash. The referenced directory definitely exists on the host, and my podman machine was created as
podman machine init --volume /Users/brasmith/
to ensure that containers should be allowed to mount any directory under my home.It is not only that container image that fails; that just happened to be one I was running at the time of the latest crash, and it was in the foreground of my terminal so I saw the error message. The previous time I saw this crash, I was running a completely different container based on fedora (built from a local Dockerfile, not publicly available), and it produced virtually the same error message after running for a few days. So, this error appears to be coming from podman itself, not the contents of the running containers.
Also, any container (I've tried fedora, ubi, and alpine, some from docker hub, some from Quay, and some from Red Hat's registry) with a host directory volume mounts appear to fail the same way after the crash:
I think I am able to consistently "recover" my system by stopping (
podman machine stop
) and restarting (podman machine start
) the podman machine. After restart, I am able to start a container using that same aforementioned command with no error.I also removed (
podman machine rm
) and recreated (podman machine init --volume /Users/brasmith/
) my podman machine the previous time this happened, hoping that might resolve the issue, but that seems to have had no effect since, well, here I am.So, it seems that something is breaking with podman's volume mounting code after "long" container runs of a few days. I've seen this happen at least three times on the latest stable release. Unfortunately, I don't know how to reproduce this failure on demand.
Steps to reproduce the issue
Steps to reproduce the issue
podman machine init --volume "$HOME"
)mkdir -p "$HOME/.ssh/discovery-keys"
)podman run -it --rm -e QPC_DBMS=sqlite -p 9999:443 -v "$HOME/.ssh/discovery-keys":/sshkeys registry.redhat.io/discovery/discovery-server-rhel9:1.4.4-1
but not necessarily this image and container; other unrelated images exhibit the same problem over time)Describe the results you received
The container crashes and podman prints an error to the terminal like this:
After encountering this error, when I try to run any container with a host directory volume mount defined, it fails to start like this:
Describe the results you expected
Containers do not crash after a few days. I can start new containers with host directory volume mounts without restarting or recreating the podman machine.
podman info output
Podman in a container
No
Privileged Or Rootless
Rootless
Upstream Latest Release
Yes
Additional environment details
My host is a 2019 MacBook Pro (2.6 GHz 6-Core Intel Core i7, 32 GB RAM) running macOS 14 (Sonoma) and the latest release version of podman (4.7.2). I am often but not always connected to the Red Hat corporate VPN (not sure if that's relevant, but VPNs can muck with networking in unexpected ways). I close/sleep my laptop at the end of the work day while leaving containers running, I open/wake it back up in the morning, and containers always come up normally outside of this issue; I mention that because at least twice (the most recent two incidents) I've seen this crash+error happen immediately after opening/waking my laptop.
Additional information
As described above, this seems to happen randomly after running a container for a few days, and maybe it is correlated with opening/waking my laptop from sleep or disconnecting/reconnecting to the Red Hat corporate VPN (since it terminates upon sleep and does not reestablish until I enter my credentials).
So far, it seems I am able to "recover" my system by stopping (
podman machine stop
) and restarting (podman machine start
) the podman machine. After restart, I am able to start a container using that same aforementioned command with no error… at least until it probably mysteriously breaks some days later.