Open grisu48 opened 1 year ago
the btrfs backend is not really supported by us, it has several limitations compared to overlay.
Does the same issue happen with overlay
?
Thank you for reaching out. The hypothesis that a process writing to the local container storage got killed sounds reasonable. The symptom very much suggests it.
Can you try running with the overlay
storage driver? The btrfs
one is not really supported.
Thank you for the reply. Yes, I can switch to the overlay
driver and will keep an eye on the server. Since this only happens once every two months I feel a reproducer is still needed, otherwise we won't be able to verify if this is effective.
I will report back if I find something. If anyone else is able to find a reproducer, that would be great. I keep trying.
I might have found a way of reproducing something that looks very similar by manually deleting the btrfs subvolume of a container image:
# podman pull registry.opensuse.org/opensuse/tumbleweed
# btrfs subvolume list /var
...
ID 445 gen 10961 top level 257 path lib/containers/storage/btrfs/subvolumes/c41c3850f07ded41774e72f20cc6d1338736763922160d73a49156a8c52c9264
microos:~ # btrfs subvolume delete /var/lib/containers/storage/btrfs/subvolumes/c41c3850f07ded41774e72f20cc6d1338736763922160d73a49156a8c52c9264
Delete subvolume (no-commit): '/var/lib/containers/storage/btrfs/subvolumes/c41c3850f07ded41774e72f20cc6d1338736763922160d73a49156a8c52c9264'
microos:~ # podman run --rm -ti registry.opensuse.org/opensuse/tumbleweed
ERRO[0000] While recovering from a failure (creating a read-write layer), error deleting layer "83d8939f7b403b37922161daa1f8df561eaa86deb6966e9ce72a46c22d6573d6": stat /var/lib/containers/storage/btrfs/subvolumes/83d8939f7b403b37922161daa1f8df561eaa86deb6966e9ce72a46c22d6573d6: no such file or directory
Error: creating container storage: creating read-write layer with ID "83d8939f7b403b37922161daa1f8df561eaa86deb6966e9ce72a46c22d6573d6": stat /var/lib/containers/storage/btrfs/subvolumes/c41c3850f07ded41774e72f20cc6d1338736763922160d73a49156a8c52c9264: no such file or directory
microos:~ # podman run --rm -ti registry.opensuse.org/opensuse/tumbleweed
WARN[0000] Found incomplete layer "83d8939f7b403b37922161daa1f8df561eaa86deb6966e9ce72a46c22d6573d6", deleting it
WARN[0000] Found incomplete layer "83d8939f7b403b37922161daa1f8df561eaa86deb6966e9ce72a46c22d6573d6", deleting it
ERRO[0000] Image registry.opensuse.org/opensuse/tumbleweed exists in local storage but may be corrupted (remove the image to resolve the issue): stat /var/lib/containers/storage/btrfs/subvolumes/83d8939f7b403b37922161daa1f8df561eaa86deb6966e9ce72a46c22d6573d6: no such file or directory
Trying to pull registry.opensuse.org/opensuse/tumbleweed:latest...
Getting image source signatures
WARN[0000] Found incomplete layer "83d8939f7b403b37922161daa1f8df561eaa86deb6966e9ce72a46c22d6573d6", deleting it
Error: copying system image from manifest list: trying to reuse blob sha256:17d9b03569f3b85e8d624ccd96d1ce96e521aa316064887ced8939778f0e0199 at destination: looking for layers with digest "sha256:17d9b03569f3b85e8d624ccd96d1ce96e521aa316064887ced8939778f0e0199": stat /var/lib/containers/storage/btrfs/subvolumes/83d8939f7b403b37922161daa1f8df561eaa86deb6966e9ce72a46c22d6573d6: no such file or directory
It's not a 100% match, but likely a good enough match to follow the broken code path. It looks to me like while trying to re-use a nonexisting image podman
stumbles and is not able to recover.
A friendly reminder that this issue had no activity for 30 days.
Same here (with overlay) didn't manage to reproduce it :/
Output of podman version
:
Client: Podman Engine
Version: 4.4.1
API Version: 4.4.1
Go Version: go1.19.5
Built: Thu Feb 9 12:58:53 2023
OS/Arch: linux/amd64
Output of podman info
:
host:
arch: amd64
buildahVersion: 1.29.0
cgroupControllers:
- memory
- pids
cgroupManager: systemd
cgroupVersion: v2
conmon:
package: conmon-2.1.5-1.fc37.x86_64
path: /usr/bin/conmon
version: 'conmon version 2.1.5, commit: '
cpuUtilization:
idlePercent: 99.91
systemPercent: 0.03
userPercent: 0.06
cpus: 32
distribution:
distribution: fedora
variant: server
version: "37"
eventLogger: journald
hostname: godzilla
idMappings:
gidmap:
- container_id: 0
host_id: 1000
size: 1
- container_id: 1
host_id: 524288
size: 65536
uidmap:
- container_id: 0
host_id: 1000
size: 1
- container_id: 1
host_id: 524288
size: 65536
kernel: 6.1.11-200.fc37.x86_64
linkmode: dynamic
logDriver: journald
memFree: 201027629056
memTotal: 202718363648
networkBackend: cni
ociRuntime:
name: crun
package: crun-1.8-1.fc37.x86_64
path: /usr/bin/crun
version: |-
crun version 1.8
commit: 0356bf4aff9a133d655dc13b1d9ac9424706cac4
rundir: /run/user/1000/crun
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
os: linux
remoteSocket:
path: /run/user/1000/podman/podman.sock
security:
apparmorEnabled: false
capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID
rootless: true
seccompEnabled: true
seccompProfilePath: /usr/share/containers/seccomp.json
selinuxEnabled: false
serviceIsRemote: false
slirp4netns:
executable: /usr/bin/slirp4netns
package: slirp4netns-1.2.0-8.fc37.x86_64
version: |-
slirp4netns version 1.2.0
commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
libslirp: 4.7.0
SLIRP_CONFIG_VERSION_MAX: 4
libseccomp: 2.5.3
swapFree: 8589930496
swapTotal: 8589930496
uptime: 0h 29m 12.00s
plugins:
authorization: null
log:
- k8s-file
- none
- passthrough
- journald
network:
- bridge
- macvlan
- ipvlan
volume:
- local
registries:
search:
- registry.fedoraproject.org
- registry.access.redhat.com
- docker.io
- quay.io
store:
configFile: /jenkins/.config/containers/storage.conf
containerStore:
number: 0
paused: 0
running: 0
stopped: 0
graphDriverName: overlay
graphOptions: {}
graphRoot: /jenkins/.local/share/containers/storage
graphRootAllocated: 1023708393472
graphRootUsed: 10074681344
graphStatus:
Backing Filesystem: xfs
Native Overlay Diff: "true"
Supports d_type: "true"
Using metacopy: "false"
imageCopyTmpDir: /var/tmp
imageStore:
number: 0
runRoot: /run/user/1000/containers
transientStore: false
volumePath: /jenkins/.local/share/containers/storage/volumes
version:
APIVersion: 4.4.1
Built: 1675940333
BuiltTime: Thu Feb 9 12:58:53 2023
GitCommit: ""
GoVersion: go1.19.5
Os: linux
OsArch: linux/amd64
Version: 4.4.1
A friendly reminder that this issue had no activity for 30 days.
If it helps, I was able to reproduce this error when I upgraded my hard drive. I unmounted it, took a disk image using GNOME disks, restored it to the new drive, used fdisk/btrfs to resize the new filesystem, and this happened.
Same here, following
Same here and I fixed it by removing the reference of the layer (that doesn't exists) in the /var/lib/containers/storage/btrfs-layers/layers.json file.
I don't know if it's a better way to solve it but now at least I can manage my containers without losing data.
Is there something here that Podman or containers/storage needs to do, or are the workarounds good enough?
This happens me a lot, but with ZFS, so the problem might not be in the storage, but Podman?
ZFS Storage driver or with file system being on ZFS?
Yes, I meant storage driver and how Podman can get in an inconsistent state with the created ZFS filesystems, so sometimes I have to manually destroy them and restart containers.
Sadly we have no expertise in ZFS File system as a storage driver. We would recommend using Overlay over a ZFS lower layer.
I'm running into a similar issue, no commands work; cant even podman info
without it hanging on:
$ podman info
WARN[0000] Found incomplete layer "7b3d606e00285c5f48522a240c092f3881b4bb4f88577fcceaf841f5fa1ea51e", deleting it
Using the overlay
driver and podman 4.9.3. I've also completely removed /var/lib/containers
and am still seeing this error somehow
This (Sporadic Found incomplete layer error results in broken container engine) also happens with ext4 (using overlay), does a new issue need to be opened?
there were changes to c/storage in the last year that could have fixed this issue.
Yours might be a different one. What version of Podman are you using?
Client: Podman Engine Version: 4.9.3 API Version: 4.9.3 Go Version: go1.22.0 Git Commit: 8d2b55ddde1bc81f43d018dfc1ac027c06b26a7f-dirty Built: Fri Feb 16 17:18:03 2024 OS/Arch: linux/amd64
First time it's happened to me in a long time of running podman on ZFS. Podman containers failed to run after a podman update this morning with this error:
WARN[0000] Found incomplete layer "df9aa3b5ac8a6adcfd7cf80962911b1576c2b41053960382fbad34565b275a08", deleting it
.
I had to figure out which container that referred to and then delete references to that container in
/var/lib/containers/storage/zfs-layers/volatile-layers.json
and /var/lib/containers/storage/zfs-containers/volatile-containers.json
, and delete the container from /var/lib/containers/storage/zfs-containers/xxxx
.
Podman info (zfs storage backend on zfs).
store:
configFile: /etc/containers/storage.conf
containerStore:
number: 7
paused: 0
running: 7
stopped: 0
graphDriverName: zfs
graphOptions: {}
graphRoot: /var/lib/containers/storage
graphRootAllocated: 643926851584
graphRootUsed: 44644958208
graphStatus:
Compression: zstd
Parent Dataset: rpool/nixos/var/lib
Parent Quota: "no"
Space Available: "599281942528"
Space Used By Parent: "143338541056"
Zpool: rpool
Zpool Health: ONLINE
imageCopyTmpDir: /var/tmp
imageStore:
number: 40
runRoot: /run/containers/storage
transientStore: false
volumePath: /var/lib/containers/storage/volumes
version:
APIVersion: 5.0.3
Built: 315532800
BuiltTime: Mon Dec 31 16:00:00 1979
GitCommit: ""
GoVersion: go1.22.4
Os: linux
OsArch: linux/amd64
Version: 5.0.3
First time it's happened to me in a long time of running podman on ZFS. Podman containers failed to run after a podman update this morning with this error:
WARN[0000] Found incomplete layer "df9aa3b5ac8a6adcfd7cf80962911b1576c2b41053960382fbad34565b275a08", deleting it
.I had to figure out which container that referred to and then delete references to that container in
/var/lib/containers/storage/zfs-layers/volatile-layers.json
and/var/lib/containers/storage/zfs-containers/volatile-containers.json
, and delete the container from/var/lib/containers/storage/zfs-containers/xxxx
.Podman info (zfs storage backend on zfs).
store: configFile: /etc/containers/storage.conf containerStore: number: 7 paused: 0 running: 7 stopped: 0 graphDriverName: zfs graphOptions: {} graphRoot: /var/lib/containers/storage graphRootAllocated: 643926851584 graphRootUsed: 44644958208 graphStatus: Compression: zstd Parent Dataset: rpool/nixos/var/lib Parent Quota: "no" Space Available: "599281942528" Space Used By Parent: "143338541056" Zpool: rpool Zpool Health: ONLINE imageCopyTmpDir: /var/tmp imageStore: number: 40 runRoot: /run/containers/storage transientStore: false volumePath: /var/lib/containers/storage/volumes version: APIVersion: 5.0.3 Built: 315532800 BuiltTime: Mon Dec 31 16:00:00 1979 GitCommit: "" GoVersion: go1.22.4 Os: linux OsArch: linux/amd64 Version: 5.0.3
I got the same issue everytime I try to delete the files manually it says permission denied. can anyone help me to reproduce the issue
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
A sporadically occurring "Found incomplete layer" error after the nighly automatic system updates on openSUSE MicroOS, results in broken
podman
container engine:Once the error occurs, nothing works anymore. Even a
podman image prune
complains about the same error and fails. The only way to fixpodman
is to manually nuke the/var/lib/containers/storage/btrfs
directory.I'm having this issue on a MicroOS installation with the most recent
podman
version (4.3.1). I have a couple of container running there and this issue occurred now for the second time in a month after the automatic nighly updates. A fellow redditor confirms the issue.The issue arises after a round of automatic updates during the night. It is unclear, if the system update or a run of podman auto-update causes the issue, I have not been able to find a reproducer yet.
Steps to reproduce the issue:
A possible reproducer can be found below
Describe the results you received:
Describe the results you expected:
Additional information you deem important (e.g. issue happens only occasionally):
Output of
podman version
:Output of
podman info
:Package info (e.g. output of
rpm -q podman
orapt list podman
orbrew info podman
):Have you tested with the latest version of Podman and have you checked Podman Troubleshooting Guide?
Yes
Additional environment details (AWS, VirtualBox, physical, etc.):
btrfs
overlayA working hypothesis is that the podman auto-update gets interrupted by a system reboot, resulting in dangling (corrupted) images. On MicroOS, the
transactional-updates
(system updates) and the podman auto-updates start times are randomized (i.e. systemd units withRandomizedDelaySec
in place), so there is the chance that the podman auto-update service gets interrupted by a system reboot. I'm running about 8 container at the host, so the vulnerable timeslot would not be negligible. This remains a hypothesis at the moment, as I was unable yet to verify this yet.