[btrfs] Sporadic Found incomplete layer error results in broken container engine

grisu48 commented 1 year ago

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

A sporadically occurring "Found incomplete layer" error after the nighly automatic system updates on openSUSE MicroOS, results in broken podman container engine:

WARN[0000] Found incomplete layer "236fcd368394d7094f40012a131c301d615722e60b25cb459efa229a7242041b", deleting it 
Error: stat /var/lib/containers/storage/btrfs/subvolumes/236fcd368394d7094f40012a131c301d615722e60b25cb459efa229a7242041b: no such file or directory

Once the error occurs, nothing works anymore. Even a podman image prune complains about the same error and fails. The only way to fix podman is to manually nuke the /var/lib/containers/storage/btrfs directory.

I'm having this issue on a MicroOS installation with the most recent podman version (4.3.1). I have a couple of container running there and this issue occurred now for the second time in a month after the automatic nighly updates. A fellow redditor confirms the issue.

The issue arises after a round of automatic updates during the night. It is unclear, if the system update or a run of podman auto-update causes the issue, I have not been able to find a reproducer yet.

Steps to reproduce the issue:

A possible reproducer can be found below

Describe the results you received:

podman container engine broken after automatic system and container updates

Describe the results you expected:

podman keeps working

Additional information you deem important (e.g. issue happens only occasionally):

Issue happens only occasionally

Output of podman version:

Client:       Podman Engine
Version:      4.3.1
API Version:  4.3.1
Go Version:   go1.17.13
Built:        Tue Nov 22 00:00:00 2022
OS/Arch:      linux/amd64

Output of podman info:

host:
  arch: amd64
  buildahVersion: 1.28.0
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.5-2.1.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.5, commit: unknown'
  cpuUtilization:
    idlePercent: 98.92
    systemPercent: 0.36
    userPercent: 0.72
  cpus: 4
  distribution:
    distribution: '"opensuse-microos"'
    version: "20221217"
  eventLogger: journald
  hostname: starfury
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 6.0.12-1-default
  linkmode: dynamic
  logDriver: journald
  memFree: 309272576
  memTotal: 7366852608
  networkBackend: cni
  ociRuntime:
    name: runc
    package: runc-1.1.4-2.1.x86_64
    path: /usr/bin/runc
    version: |-
      runc version 1.1.4
      commit: v1.1.4-0-ga916309fff0f
      spec: 1.0.2-dev
      go: go1.18.6
      libseccomp: 2.5.4
  os: linux
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /etc/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-1.1.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: unknown
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 5
      libseccomp: 2.5.4
  swapFree: 0
  swapTotal: 0
  uptime: 3h 47m 14.00s (Approximately 0.12 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.opensuse.org
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 8
    paused: 0
    running: 8
    stopped: 0
  graphDriverName: btrfs
  graphOptions: {}
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 26834087936
  graphRootUsed: 9974857728
  graphStatus:
    Build Version: Btrfs v6.0.2
    Library Version: "102"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 8
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.3.1
  Built: 1669075200
  BuiltTime: Tue Nov 22 00:00:00 2022
  GitCommit: ""
  GoVersion: go1.17.13
  Os: linux
  OsArch: linux/amd64
  Version: 4.3.1

Package info (e.g. output of rpm -q podman or apt list podman or brew info podman):

podman-4.3.1-1.1.x86_64

Have you tested with the latest version of Podman and have you checked Podman Troubleshooting Guide?

Yes

Version running: 4.3.1
I couldn't find any related entries in the Troubleshooting Guide

Additional environment details (AWS, VirtualBox, physical, etc.):

KVM Virtual machine running openSUSE MicroOS
I'm using the btrfs overlay

A working hypothesis is that the podman auto-update gets interrupted by a system reboot, resulting in dangling (corrupted) images. On MicroOS, the transactional-updates (system updates) and the podman auto-updates start times are randomized (i.e. systemd units with RandomizedDelaySec in place), so there is the chance that the podman auto-update service gets interrupted by a system reboot. I'm running about 8 container at the host, so the vulnerable timeslot would not be negligible. This remains a hypothesis at the moment, as I was unable yet to verify this yet.

giuseppe commented 1 year ago

the btrfs backend is not really supported by us, it has several limitations compared to overlay.

Does the same issue happen with overlay?

vrothberg commented 1 year ago

Thank you for reaching out. The hypothesis that a process writing to the local container storage got killed sounds reasonable. The symptom very much suggests it.

Can you try running with the overlay storage driver? The btrfs one is not really supported.

grisu48 commented 1 year ago

Thank you for the reply. Yes, I can switch to the overlay driver and will keep an eye on the server. Since this only happens once every two months I feel a reproducer is still needed, otherwise we won't be able to verify if this is effective.

I will report back if I find something. If anyone else is able to find a reproducer, that would be great. I keep trying.

grisu48 commented 1 year ago

I might have found a way of reproducing something that looks very similar by manually deleting the btrfs subvolume of a container image:

# podman pull registry.opensuse.org/opensuse/tumbleweed
# btrfs subvolume list /var
...
ID 445 gen 10961 top level 257 path lib/containers/storage/btrfs/subvolumes/c41c3850f07ded41774e72f20cc6d1338736763922160d73a49156a8c52c9264

microos:~ # btrfs subvolume delete /var/lib/containers/storage/btrfs/subvolumes/c41c3850f07ded41774e72f20cc6d1338736763922160d73a49156a8c52c9264
Delete subvolume (no-commit): '/var/lib/containers/storage/btrfs/subvolumes/c41c3850f07ded41774e72f20cc6d1338736763922160d73a49156a8c52c9264'

microos:~ # podman run --rm -ti registry.opensuse.org/opensuse/tumbleweed
ERRO[0000] While recovering from a failure (creating a read-write layer), error deleting layer "83d8939f7b403b37922161daa1f8df561eaa86deb6966e9ce72a46c22d6573d6": stat /var/lib/containers/storage/btrfs/subvolumes/83d8939f7b403b37922161daa1f8df561eaa86deb6966e9ce72a46c22d6573d6: no such file or directory 
Error: creating container storage: creating read-write layer with ID "83d8939f7b403b37922161daa1f8df561eaa86deb6966e9ce72a46c22d6573d6": stat /var/lib/containers/storage/btrfs/subvolumes/c41c3850f07ded41774e72f20cc6d1338736763922160d73a49156a8c52c9264: no such file or directory

microos:~ # podman run --rm -ti registry.opensuse.org/opensuse/tumbleweed
WARN[0000] Found incomplete layer "83d8939f7b403b37922161daa1f8df561eaa86deb6966e9ce72a46c22d6573d6", deleting it 
WARN[0000] Found incomplete layer "83d8939f7b403b37922161daa1f8df561eaa86deb6966e9ce72a46c22d6573d6", deleting it 
ERRO[0000] Image registry.opensuse.org/opensuse/tumbleweed exists in local storage but may be corrupted (remove the image to resolve the issue): stat /var/lib/containers/storage/btrfs/subvolumes/83d8939f7b403b37922161daa1f8df561eaa86deb6966e9ce72a46c22d6573d6: no such file or directory 
Trying to pull registry.opensuse.org/opensuse/tumbleweed:latest...
Getting image source signatures
WARN[0000] Found incomplete layer "83d8939f7b403b37922161daa1f8df561eaa86deb6966e9ce72a46c22d6573d6", deleting it 
Error: copying system image from manifest list: trying to reuse blob sha256:17d9b03569f3b85e8d624ccd96d1ce96e521aa316064887ced8939778f0e0199 at destination: looking for layers with digest "sha256:17d9b03569f3b85e8d624ccd96d1ce96e521aa316064887ced8939778f0e0199": stat /var/lib/containers/storage/btrfs/subvolumes/83d8939f7b403b37922161daa1f8df561eaa86deb6966e9ce72a46c22d6573d6: no such file or directory

It's not a 100% match, but likely a good enough match to follow the broken code path. It looks to me like while trying to re-use a nonexisting image podman stumbles and is not able to recover.

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

benipeled commented 1 year ago

Same here (with overlay) didn't manage to reproduce it :/

Output of podman version:

Client:       Podman Engine
Version:      4.4.1
API Version:  4.4.1
Go Version:   go1.19.5
Built:        Thu Feb  9 12:58:53 2023
OS/Arch:      linux/amd64

Output of podman info:

host:
  arch: amd64
  buildahVersion: 1.29.0
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.5-1.fc37.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.5, commit: '
  cpuUtilization:
    idlePercent: 99.91
    systemPercent: 0.03
    userPercent: 0.06
  cpus: 32
  distribution:
    distribution: fedora
    variant: server
    version: "37"
  eventLogger: journald
  hostname: godzilla
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 524288
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 524288
      size: 65536
  kernel: 6.1.11-200.fc37.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 201027629056
  memTotal: 202718363648
  networkBackend: cni
  ociRuntime:
    name: crun
    package: crun-1.8-1.fc37.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8
      commit: 0356bf4aff9a133d655dc13b1d9ac9424706cac4
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-8.fc37.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 8589930496
  swapTotal: 8589930496
  uptime: 0h 29m 12.00s
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /jenkins/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /jenkins/.local/share/containers/storage
  graphRootAllocated: 1023708393472
  graphRootUsed: 10074681344
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 0
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /jenkins/.local/share/containers/storage/volumes
version:
  APIVersion: 4.4.1
  Built: 1675940333
  BuiltTime: Thu Feb  9 12:58:53 2023
  GitCommit: ""
  GoVersion: go1.19.5
  Os: linux
  OsArch: linux/amd64
  Version: 4.4.1

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

joemccall86 commented 1 year ago

If it helps, I was able to reproduce this error when I upgraded my hard drive. I unmounted it, took a disk image using GNOME disks, restored it to the new drive, used fdisk/btrfs to resize the new filesystem, and this happened.

paolo-depa commented 1 year ago

Same here, following

thaycafe commented 1 year ago

Same here and I fixed it by removing the reference of the layer (that doesn't exists) in the /var/lib/containers/storage/btrfs-layers/layers.json file.

I don't know if it's a better way to solve it but now at least I can manage my containers without losing data.

rhatdan commented 1 year ago

Is there something here that Podman or containers/storage needs to do, or are the workarounds good enough?

kissgyorgy commented 1 year ago

This happens me a lot, but with ZFS, so the problem might not be in the storage, but Podman?

rhatdan commented 1 year ago

ZFS Storage driver or with file system being on ZFS?

kissgyorgy commented 1 year ago

Yes, I meant storage driver and how Podman can get in an inconsistent state with the created ZFS filesystems, so sometimes I have to manually destroy them and restart containers.

rhatdan commented 1 year ago

Sadly we have no expertise in ZFS File system as a storage driver. We would recommend using Overlay over a ZFS lower layer.

kousun12 commented 6 months ago

I'm running into a similar issue, no commands work; cant even podman info without it hanging on:

$ podman info
WARN[0000] Found incomplete layer "7b3d606e00285c5f48522a240c092f3881b4bb4f88577fcceaf841f5fa1ea51e", deleting it

Using the overlay driver and podman 4.9.3. I've also completely removed /var/lib/containers and am still seeing this error somehow

tbertels commented 4 months ago

This (Sporadic Found incomplete layer error results in broken container engine) also happens with ext4 (using overlay), does a new issue need to be opened?

giuseppe commented 4 months ago

there were changes to c/storage in the last year that could have fixed this issue.

Yours might be a different one. What version of Podman are you using?

tbertels commented 4 months ago

Client: Podman Engine Version: 4.9.3 API Version: 4.9.3 Go Version: go1.22.0 Git Commit: 8d2b55ddde1bc81f43d018dfc1ac027c06b26a7f-dirty Built: Fri Feb 16 17:18:03 2024 OS/Arch: linux/amd64

firecat53 commented 3 months ago

First time it's happened to me in a long time of running podman on ZFS. Podman containers failed to run after a podman update this morning with this error: WARN[0000] Found incomplete layer "df9aa3b5ac8a6adcfd7cf80962911b1576c2b41053960382fbad34565b275a08", deleting it.

I had to figure out which container that referred to and then delete references to that container in /var/lib/containers/storage/zfs-layers/volatile-layers.json and /var/lib/containers/storage/zfs-containers/volatile-containers.json, and delete the container from /var/lib/containers/storage/zfs-containers/xxxx.

Podman info (zfs storage backend on zfs).

store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 7
    paused: 0
    running: 7
    stopped: 0
  graphDriverName: zfs
  graphOptions: {}
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 643926851584
  graphRootUsed: 44644958208
  graphStatus:
    Compression: zstd
    Parent Dataset: rpool/nixos/var/lib
    Parent Quota: "no"
    Space Available: "599281942528"
    Space Used By Parent: "143338541056"
    Zpool: rpool
    Zpool Health: ONLINE
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 40
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 5.0.3
  Built: 315532800
  BuiltTime: Mon Dec 31 16:00:00 1979
  GitCommit: ""
  GoVersion: go1.22.4
  Os: linux
  OsArch: linux/amd64
  Version: 5.0.3

ravurvi20 commented 1 month ago

First time it's happened to me in a long time of running podman on ZFS. Podman containers failed to run after a podman update this morning with this error: WARN[0000] Found incomplete layer "df9aa3b5ac8a6adcfd7cf80962911b1576c2b41053960382fbad34565b275a08", deleting it.

I had to figure out which container that referred to and then delete references to that container in /var/lib/containers/storage/zfs-layers/volatile-layers.json and /var/lib/containers/storage/zfs-containers/volatile-containers.json, and delete the container from /var/lib/containers/storage/zfs-containers/xxxx.

Podman info (zfs storage backend on zfs).
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 7
    paused: 0
    running: 7
    stopped: 0
  graphDriverName: zfs
  graphOptions: {}
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 643926851584
  graphRootUsed: 44644958208
  graphStatus:
    Compression: zstd
    Parent Dataset: rpool/nixos/var/lib
    Parent Quota: "no"
    Space Available: "599281942528"
    Space Used By Parent: "143338541056"
    Zpool: rpool
    Zpool Health: ONLINE
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 40
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 5.0.3
  Built: 315532800
  BuiltTime: Mon Dec 31 16:00:00 1979
  GitCommit: ""
  GoVersion: go1.22.4
  Os: linux
  OsArch: linux/amd64
  Version: 5.0.3

I got the same issue everytime I try to delete the files manually it says permission denied. can anyone help me to reproduce the issue

containers / podman

[btrfs] Sporadic Found incomplete layer error results in broken container engine #16882