Closed lewo closed 3 years ago
A friendly reminder that this issue had no activity for 30 days.
This issue seems to have been lost. Sorry about that, are you still having issues with this?
Hi, yes occasionally. Seems either the ip config file can be left dangling or a reference to the image is left.
Leigh
On 28 May 2020, at 01:51, Daniel J Walsh notifications@github.com wrote:
This issue seems to have been lost. Sorry about that, are you still having issues with this?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
I was under the impression the symlink issue with images was resolved already in c/storage, but that sounds to be incorrect
@nalind PTAL
When we get this error, I think that calling recreateSymlinks()
, or a version of it that only cared about the link for the specific layer whose link we couldn't read, would work around this.
Is there an easy way to determine if the error in question is an error that would require recreateSymlinks()
? Alternatively, how bad is this, performance-wise - I could force it to run every time Podman detects a reboot...
It looks like the error that's coming back from Readlink()
in this case would causeos.IsNotExist()
to return true
.
@lewo @nalind @mheon What should we do with this issue?
I am seeing this issue too.
$ podman version
Version: 2.1.1
API Version: 2.0.0
Go Version: go1.13.15
Built: Fri Oct 2 07:30:39 2020
OS/Arch: linux/amd64
$ podman info --debug
host:
arch: amd64
buildahVersion: 1.16.1
cgroupManager: cgroupfs
cgroupVersion: v1
conmon:
package: conmon-2.0.21-1.el8.x86_64
path: /usr/bin/conmon
version: 'conmon version 2.0.21, commit: fa5f92225c4c95759d10846106c1ebd325966f91-dirty'
cpus: 2
distribution:
distribution: '"centos"'
version: "8"
eventLogger: journald
hostname: littlesally
idMappings:
gidmap:
- container_id: 0
host_id: 1000
size: 1
- container_id: 1
host_id: 100000
size: 65536
uidmap:
- container_id: 0
host_id: 1000
size: 1
- container_id: 1
host_id: 100000
size: 65536
kernel: 4.18.0-193.19.1.el8_2.x86_64
linkmode: dynamic
memFree: 1683558400
memTotal: 3798777856
ociRuntime:
name: runc
package: runc-1.0.0-65.rc10.module_el8.2.0+305+5e198a41.x86_64
path: /usr/bin/runc
version: 'runc version spec: 1.0.1-dev'
os: linux
remoteSocket:
path: /run/user/1000/podman/podman.sock
rootless: true
slirp4netns:
executable: /usr/bin/slirp4netns
package: slirp4netns-0.4.2-3.git21fdece.module_el8.2.0+305+5e198a41.x86_64
version: |-
slirp4netns version 0.4.2+dev
commit: 21fdece2737dc24ffa3f01a341b8a6854f8b13b4
swapFree: 4102025216
swapTotal: 4102025216
uptime: 7h 38m 38.17s (Approximately 0.29 days)
registries:
search:
- registry.access.redhat.com
- registry.redhat.io
- docker.io
store:
configFile: /home/daniel/.config/containers/storage.conf
containerStore:
number: 0
paused: 0
running: 0
stopped: 0
graphDriverName: overlay
graphOptions:
overlay.mount_program:
Executable: /usr/bin/fuse-overlayfs
Package: fuse-overlayfs-0.7.2-5.module_el8.2.0+305+5e198a41.x86_64
Version: |-
fuse-overlayfs: version 0.7.2
FUSE library version 3.2.1
using FUSE kernel interface version 7.26
graphRoot: /home/daniel/.local/share/containers/storage
graphStatus:
Backing Filesystem: xfs
Native Overlay Diff: "false"
Supports d_type: "true"
Using metacopy: "false"
imageStore:
number: 12
runRoot: /run/user/1000
volumePath: /home/daniel/.local/share/containers/storage/volumes
version:
APIVersion: 2.0.0
Built: 1601649039
BuiltTime: Fri Oct 2 07:30:39 2020
GitCommit: ""
GoVersion: go1.13.15
OsArch: linux/amd64
Version: 2.1.1
$ /usr/bin/podman run -a stdout -a stderr --cgroups no-conmon --conmon-pidfile /run/user/1000/team-heist-tactics.service-pid --cidfile /run/user/1000/team-heist-tactics.service-cid -v /var/www/team_heist_tactics_static:/bindmounted_static --publish 127.0.0.1:19996:19996 --name team-heist-tactics docker.pkg.github.com/banool/team_heist_tactics/team_heist_tactics:latest
Error: readlink /home/daniel/.local/share/containers/storage/overlay/l/YAUGQXTCBOZLL5DOMFTOX6KLBI: no such file or directory
The workaround for me was to delete the image:
podman image rm -f docker.pkg.github.com/banool/team_heist_tactics/team_heist_tactics:latest
Any idea how you got into this state? Do you have a reproducer?
@rhatdan This happened for me when I had an orange pi zero with its underpowered sdcard to try pulling & launching around 7 containers simultaneously. Of course the little guy overheated and/or kernel panicked.
Whatever happened caused a forced shutdown during pulling/creation/startup of containers, possibly had containers in most of these stages, since 2 of those images were relatively small and were probably into later stages than the others.
So high load, multiple concurrent operations and unclean shutdown. Maybe a high ext4 commit will cause this issue to reproduce more reliably.
I don't have a reliable reproduction @rhatdan , but I hit it frequently enough using containers running couchdb during any shutdown/reboot. It seems to be more likely when the shutdown comes as a power off of a VM
@giuseppe Could this be fuse-overlay related, or is this just a partial removal from container storage that is causing this problem?
@rhatdan I don't think it is related to fuse-overlayfs. Generally the storage can get corrupted on a forced shutdown and the missing symlinks is just one symptom. What I am worried the most about is that images could be corrupted as well (e.g. missing or incomplete files) and this is difficult to detect.
When running in a cluster, on the next node boot CRI-O wipes out the entire storage if the node wasn't stopped cleanly. I think this is still the safest we can do now, until we will have something like "podman storage fsck" that can verify that each file in the images is not corrupted and if needed re-pull the image.
How difficult would it be to reassemble the storage with an fsck option? The difference between CRI-O and Podman is blowing away of containers, could mean loss of a serious amount of work. Think toolbox containers.
What I am worried the most about is that images could be corrupted as well (e.g. missing or incomplete files) and this is difficult to detect
I can confirm this is a thing that happens. I've seen some applications crashing for no apparent reason, which turned out to be fixed by removing and re-pulling the image (same digest). But again, no reproducer.
Does podman support any sort of read-only rootfs setup? Like storing images in a partition which gets mounted as ro? Or even the whole rootfs mounted as ro.
How difficult would it be to reassemble the storage with an fsck option? The difference between CRI-O and Podman is blowing away of containers, could mean loss of a serious amount of work. Think toolbox containers.
we would need to checksum each file in the image. It would get us closer to the OSTree storage model. OSTree has a fsck operation that works this way.
Alternatively, more expensive in terms of I/O, we record the image is pulled only after we do a syncfs()
.
Does podman support any sort of read-only rootfs setup? Like storing images in a partition which gets mounted as ro? Or even the whole rootfs mounted as ro.
You can use an additional store that works exactly how you described it. The entire storage is on a read-only partition and tell Podman to use it with:
additionalimagestores = [
"/path/to/the/storage"
]
in the storage.conf
file
A friendly reminder that this issue had no activity for 30 days.
@nalind Any movement on this?
Sorry, been focused on other bugs.
I just faced this issue in a very strange situation: Background: I run podman 2.1.1 on Ubuntu 20.04 WSL distro Steps to reproduce:
A friendly reminder that this issue had no activity for 30 days.
This problem still exists.
A similar issue exists after a simple Linux sudo reboot. Neither locks nor ports are released. But the correct systemctl reboot works OK. It looks like systemd's issue, not a Podman's Entire discussion about power off is here Shurdown after power off/on
I too can confirm that this issue has just recently occurred and is still happening. It's kind of nasty. What makes this bad is the fact that the container is being started by systemd. People will scratch their heads for a LONG TIME before finding a solution.
There was a surge in the zone over the weekend which resulted in a short loss of electricity. I had recently moved the raspberrypi I use for testing and it had no UPS. Only one container seems to have been affected a rootfull haproxy. The other rootless containers seem to have not malfunctioned.
The solution was indeed to pull haproxy again. Not really something I'd want to do in production considering that haproxy and podman are both key components.
`$ sudo podman info --debug host: arch: arm buildahVersion: 1.16.1 cgroupManager: systemd cgroupVersion: v1 conmon: package: 'conmon: /usr/libexec/podman/conmon' path: /usr/libexec/podman/conmon version: 'conmon version 2.0.20, commit: ' cpus: 4 distribution: distribution: raspbian version: "10" eventLogger: journald hostname: raspberrypi idMappings: gidmap: null uidmap: null kernel: 5.4.72-v7l+ linkmode: dynamic memFree: 3287650304 memTotal: 4013862912 ociRuntime: name: runc package: 'runc: /usr/sbin/runc' path: /usr/sbin/runc version: |- runc version 1.0.0~rc6+dfsg1 commit: 1.0.0~rc6+dfsg1-3 spec: 1.0.1 os: linux remoteSocket: path: /run/podman/podman.sock rootless: false slirp4netns: executable: "" package: "" version: "" swapFree: 104853504 swapTotal: 104853504 uptime: 1h 15m 47.34s (Approximately 0.04 days) registries: search:
Issue fixed in c/storage by https://github.com/containers/storage/pull/822
@umohnani8 Please backport containers/storage#822 to v1.26 so we can update the vendor in podman to fix this in podman 3.0.
@rhatdan the patch is already in c/storage v1.26
Did you open a PR to vendor in an update?
I don't think that'll pass CI due to the libcap farts, see https://github.com/containers/podman/pull/9462
I've also hit this on podman 2.2.1 on Fedora IoT. I want to emphasize that this is really problematic in low-bandwidth IoT use-cases with unstable power supply as I'm facing right now: when operating podman on devices at sites with slow connection speeds, limits (around 500M to 1G per Month) or a per KB/MB billing model a corrupted storage and re-downloading 500MB nodejs images is either impossible or at least lethal on site.
For this it's crucial to store all images on update in a dedicated, read-only storage like giuseppe suggested earlier.
Just mentioning it because it caused a lot of headache already in the past and maybe others hitting this with a similar use-case can benefit from the idea.
As a side-note on ostree based systems: I think it's a viable approach for those scenarios to commit the images into the ostree which also allows diff-based container updates, since ostree comes with the ability to do delta updates. It also ties the container image state / version to the OS version, which is also a nice property IMO.
The storage fix made it into podman 3.1.0-rc1. @rhatdan do we plan to backport this fix for 2.2.1 as well?
no.
Okay, this is fixed in podman 3.1.0-rc1
then, closing the issue now.
Just to clarify, I'm assuming that the ro-store workaround would not be sufficient for containers run as root, since it relies on filesystem permissions, right?
I'm getting hit with this fairly often using preemptible instances on google cloud. Since I have to expect random hard shutdowns, I'm already taking container checkpoints at regular intervals (which require root). My fairly overkill workaround to the random corruption is if after boot, any podman container inspect
or podman image inspect
returns a nonzero exitcode, I dump a list of containers, podman system reset
, repull my images, and restore my container list from whatever checkpoints happen to be present.
My scripts seem to catch the corruption and allow recovery, but it's fairly aggressive.
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
Steps to reproduce the issue:
sudo /usr/bin/podman run --rm -d --name atomix-1 -p 5679:5679 -it -v /opt/onos/config:/etc/atomix/conf -v /var/lib/atomix-1/data:/var/lib/atomix/data:Z atomix/atomix:3.1.5 --config /etc/atomix/conf/atomix-1.conf --ignore-resources --data-dir /var/lib/atomix/data --log-level WARN
sudo virsh destroy
Describe the results you received: Container failed to start with error readlink below: no such file or directory. This occurs approximate 1 in 5 forced shutdowns
sudo /usr/bin/podman run --rm -d --name atomix-1 -p 5679:5679 -it -v /opt/onos/config:/etc/atomix/conf -v /var/lib/atomix-1/data:/var/lib/atomix/data:Z atomix/atomix:3.1.5 --config /etc/atomix/conf/atomix-1.conf --ignore-resources --data-dir /var/lib/atomix/data --log-level WARN Error: readlink /var/lib/containers/storage/overlay/l/QRPHWAOMUOP7RQXQKPUY4Y7I3Z: no such file or directory
sudo podman inspect localhost/atomix/atomix:3.1.5 Error: error parsing image data "57ddcf43f4ac8f399810d4b44ded2c3a63e5abfb672bc447c3aa0f18e39a282c": readlink /var/lib/containers/storage/overlay/l/GMVU2BJI2CBP6Z2DFDEHCCZGTD: no such file or directory
Describe the results you expected: Container starts correctly
Additional information you deem important (e.g. issue happens only occasionally): The only work around seems to be to delete the image and re pull:
sudo podman rm -f atomix/atomix:3.1.5 sudo podman pull atomix/atomix:3.1.5
Output of
podman version
:Output of
podman info --debug
:Package info (e.g. output of
rpm -q podman
orapt list podman
):Additional environment details (AWS, VirtualBox, physical, etc.): KVM CentOS 8.1 Guest VM running latest stable podman.