cri-o / cri-o

Open Container Initiative-based implementation of Kubernetes Container Runtime Interface
https://cri-o.io
Apache License 2.0
5.05k stars 1.04k forks source link

Getting Intermittent ErrImagePull errors which don't go away once they appear on a given node #5793

Open matthias50 opened 2 years ago

matthias50 commented 2 years ago

What happened?

After updating nodes in our fleet of k8s clusters with an aggregate node count of a few thousand from Kubernetes 1.20/crio v1.20.6 to Kubernetes 1.21/crio 1.21.6, we observe that sometimes nodes will get into a state where a particular image can't be pulled just to that node. We get error messages like:

kubelet[6135]: E0407 15:26:43.115431    6135 kuberuntime_manager.go:864] init container &Container{Name:chmod-volume,Image:redacted.com/redacted/redacted-image:redacted-tag,Command:[sh -c chmod 1777 /tmp],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{cpu: {{200 -3} {<nil>} 200m DecimalSI},memory: {{209715200 0} {<nil>}  BinarySI},},Requests:ResourceList{cpu: {{100 -3} {<nil>} 100m DecimalSI},memory: {{104857600 0} {<nil>} 100Mi BinarySI},},},VolumeMounts:[]VolumeMount{VolumeMount{Name:tmp-volume,ReadOnly:false,MountPath:/tmp,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:nil,Privileged:nil,SELinuxOptions:nil,RunAsUser:*0,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:*false,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod redacted-pod_redacted_namespace(8cf5a89c-98a5-4b57-9802-cb72ddf25a4a): ErrImagePull: rpc error: code = Unknown desc = Error committing the finished image: error adding layer with blob "sha256:ecda1bc969d753f5553712497752134d01985271a430d3164c498c4d7a9cae8a": error creating layer with ID "62a09a3264ffd51fb1cd3d9846363049c6f1a68879bb4d7e433d1eb802f8516f": Stat /var/lib/containers/storage/overlay/21721fd694065db2bc02f5cb9e8ff346f37df977cf76aefade8909eda563b367: no such file or directory
kubelet[6135]: E0407 15:26:43.115502    6135 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"chmod-volume\" with ErrImagePull: \"rpc error: code = Unknown desc = Error committing the finished image: error adding layer with blob \\\"sha256:ecda1bc969d753f5553712497752134d01985271a430d3164c498c4d7a9cae8a\\\": error creating layer with ID \\\"62a09a3264ffd51fb1cd3d9846363049c6f1a68879bb4d7e433d1eb802f8516f\\\": Stat /var/lib/containers/storage/overlay/21721fd694065db2bc02f5cb9e8ff346f37df977cf76aefade8909eda563b367: no such file or directory\"" pod="redacted-namespace/redacted_pod" podUID=8cf5a89c-98a5-4b57-9802-cb72ddf25a4a
kubelet[6135]: E0407 15:31:45.238422    6135 kuberuntime_manager.go:864] init container &Container{Name:chmod-volume,Image:redacted.com/redacted/redacted-image:redacted-tag,Command:[sh -c chmod 1777 /tmp],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{cpu: {{200 -3} {<nil>} 200m DecimalSI},memory: {{209715200 0} {<nil>}  BinarySI},},Requests:ResourceList{cpu: {{100 -3} {<nil>} 100m DecimalSI},memory: {{104857600 0} {<nil>} 100Mi BinarySI},},},VolumeMounts:[]VolumeMount{VolumeMount{Name:tmp-volume,ReadOnly:false,MountPath:/tmp,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:nil,Privileged:nil,SELinuxOptions:nil,RunAsUser:*0,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:*false,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod redacted-pod_redacted_namespace(8cf5a89c-98a5-4b57-9802-cb72ddf25a4a): ErrImagePull: rpc error: code = Unknown desc = Error committing the finished image: error adding layer with blob "sha256:ecda1bc969d753f5553712497752134d01985271a430d3164c498c4d7a9cae8a": error creating layer with ID "62a09a3264ffd51fb1cd3d9846363049c6f1a68879bb4d7e433d1eb802f8516f": Stat /var/lib/containers/storage/overlay/21721fd694065db2bc02f5cb9e8ff346f37df977cf76aefade8909eda563b367: no such file or directory
kubelet[6135]: E0407 15:31:45.238483    6135 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"chmod-volume\" with ErrImagePull: \"rpc error: code = Unknown desc = Error committing the finished image: error adding layer with blob \\\"sha256:ecda1bc969d753f5553712497752134d01985271a430d3164c498c4d7a9cae8a\\\": error creating layer with ID \\\"62a09a3264ffd51fb1cd3d9846363049c6f1a68879bb4d7e433d1eb802f8516f\\\": Stat /var/lib/containers/storage/overlay/21721fd694065db2bc02f5cb9e8ff346f37df977cf76aefade8909eda563b367: no such file or directory\"" pod="redacted-namespace/redacted_pod" podUID=8cf5a89c-98a5-4b57-9802-cb72ddf25a4a

If we attempt to manually pull the image using crictl, we get a similar error:

crictl pull redacted.com/redacted/redacted-image:redacted-tag
FATA[0001] pulling image: rpc error: code = Unknown desc = Error committing the finished image: error adding layer with blob "sha256:ecda1bc969d753f5553712497752134d01985271a430d3164c498c4d7a9cae8a": error creating layer with ID "62a09a3264ffd51fb1cd3d9846363049c6f1a68879bb4d7e433d1eb802f8516f": Stat /var/lib/containers/storage/overlay/21721fd694065db2bc02f5cb9e8ff346f37df977cf76aefade8909eda563b367: no such file or directory

The image in question doesn't show up in crictl images listings

I can pull the same image from another node on the same cluster using crictl pull

What did you expect to happen?

Image pulls for a pod eventually succeed.

How can we reproduce it (as minimally and precisely as possible)?

Unsure. Some of the worst impacted workloads on these nodes provision quite a few pods very quickly which all reference the same image, so this may be an ingredient. However, we also intermittently see daemonset pods get hit by this same error.

Anything else we need to know?

No response

CRI-O and Kubernetes version

```console $ crio --version crio version 1.21.6 Version: 1.21.6 GitCommit: c03e400ec69fb22256a1b9c8f3ae8011a3539582 GitTreeState: clean BuildDate: 2022-03-16T17:45:34Z GoVersion: go1.16.12 Compiler: gc Platform: linux/amd64 Linkmode: dynamic ``` ```console $ kubectl --version Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.10", GitCommit:"a7a32748b5c60445c4c7ee904caf01b91f2dbb71", GitTreeState:"clean", BuildDate:"2022-02-16T11:24:04Z", GoVersion:"go1.16.14", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.10", GitCommit:"a7a32748b5c60445c4c7ee904caf01b91f2dbb71", GitTreeState:"clean", BuildDate:"2022-02-16T11:18:16Z", GoVersion:"go1.16.14", Compiler:"gc", Platform:"linux/amd64"}```

OS version

```console # On Linux: $ cat /etc/os-release NAME="Rocky Linux" VERSION="8.5 (Green Obsidian)" ID="rocky" ID_LIKE="rhel centos fedora" VERSION_ID="8.5" PLATFORM_ID="platform:el8" PRETTY_NAME="Rocky Linux 8.5 (Green Obsidian)" ANSI_COLOR="0;32" CPE_NAME="cpe:/o:rocky:rocky:8:GA" HOME_URL="https://rockylinux.org/" BUG_REPORT_URL="https://bugs.rockylinux.org/" ROCKY_SUPPORT_PRODUCT="Rocky Linux" ROCKY_SUPPORT_PRODUCT_VERSION="8" $ uname -a Linux ip-10-117-235-108.eu-west-1.compute.internal 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Thu Mar 10 20:59:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux ```

Additional environment details (AWS, VirtualBox, physical, etc.)

Cluster is running on AWS and was deployed using custom Terraform and puppet code.
haircommander commented 2 years ago

@nalind @giuseppe @rhatdan any ideas here?

matthias50 commented 2 years ago

Per @haircommander's comment on slack, here is where I tried to pull an image with podman. It fails in a similar way:

podman pull redacted.com/redacted/redacted-image:redacted-tag
Trying to pull redacted.com/redacted/redacted-image:redacted-tag...
Getting image source signatures
Copying blob 6a591b1a044c skipped: already exists
Copying blob 199641e53e23 skipped: already exists
Copying blob 6ce44203b5a1 done
Copying blob 54edd6698670 done
Copying blob 253312ff3567 done
Copying blob 60514338db83 done
Copying blob 64f61668fcda done
Error: writing blob: adding layer with blob "sha256:6ce44203b5a1a923806e6772cd43910d1b2ca31ee48b55fdd137bbfdcb280b5e": error creating layer with ID "2d0605e1888dc60a1c795112f0ea15489a80285671e4e66910b5111c32adecb2": Stat /var/lib/containers/storage/overlay/be54beb5feaaf1292910402ad742dd1e963f34a61db2b19a898321292bae239d: no such file or directory

crictl pull on the same node with the same image fails with:

FATA[0010] pulling image: rpc error: code = Unknown desc = Error committing the finished image: error adding layer with blob "sha256:6ce44203b5a1a923806e6772cd43910d1b2ca31ee48b55fdd137bbfdcb280b5e": error creating layer with ID "2d0605e1888dc60a1c795112f0ea15489a80285671e4e66910b5111c32adecb2": Stat /var/lib/containers/storage/overlay/be54beb5feaaf1292910402ad742dd1e963f34a61db2b19a898321292bae239d: no such file or directory
haircommander commented 2 years ago

also, what version of podman?

matthias50 commented 2 years ago

also, what version of podman?

3.4.2

flouthoc commented 2 years ago

I have been trying to reproduce this issue for a while, issue cause looks similar (not the same) to explained here https://github.com/containers/storage/pull/1033 but here looks like the parent layer was commited in erroneous state or was removed but not removed properly by another process possibly in a racy manner.

But I'm unable to reproduce this issue exactly, one reason could be that version v1.32.6 is being used by crio v1.20.6 but no podman was released using this version of c/storage or some edge-case scenario, so I'm am unable to reproduce this exact scenario.

@matthias50 Could you try reproducing this with podman 3.4.2 but clean all the storage first using podman system prune --all --force on the affected node ? I am bit convinced that this might not exist on new version of c/storage so worth trying.

PS: Just a disclaimer I have not tried to reproduce this issue with exact env Kubernetes 1.20/crio v1.20.6 to Kubernetes 1.21/crio 1.21.6 on a multi-node setup.

giuseppe commented 2 years ago

it could also depend on that particular image.

Is there any way you could share it with us?

matthias50 commented 2 years ago

But I'm unable to reproduce this issue exactly, one reason could be that version v1.32.6 is being used by crio v1.20.6 but no podman was released using this version of c/storage or some edge-case scenario, so I'm am unable to reproduce this exact scenario.

@flouthoc did you meant to say 1.21.6 here? 1.20.6 is the version we were using previously.

matthias50 commented 2 years ago

it could also depend on that particular image. Is there any way you could share it with us?

@giuseppe I'm pretty sure we are seeing this with multiple images, but there may be some common thread/base image or something. I'll have to do some digging on this. Not sure if we can share these images or not, but I'll look into it.

flouthoc commented 2 years ago

@matthias50 Yes both 1.21.6 and 1.20.6 are using significantly older version of c/storage if you could try a small experiment suggested here https://github.com/cri-o/cri-o/issues/5793#issuecomment-1098775717 it will confirm my hypothesis.

@matthias50 Could you try reproducing this with podman 3.4.2 but clean all the storage first using podman system prune --all --force on the affected node ? I am bit convinced that this might not exist on new version of c/storage so worth trying.

matthias50 commented 2 years ago

@flouthoc, I'll update our incident response to tell people to try this the next time it happens on a non production node. Hopefully they read the docs. :) Doing this on a production node (especially without having tried it on a non prod node in this state) is too risky.

matthias50 commented 2 years ago

@flouthoc, I tried cleaning as you suggested and the image pull still fails with crictl and with podman:

# podman --version
podman version 3.4.2

# podman system prune --all --force
Deleted Images
1ff279da1954372d24eb98a72632c29da79185119b6ad8931cc947b9a533ba7c
52a41d719d6cfe0dfc84853f45c433e44f841d7e2bb66145eca1db291d3e0c4a
e4d8f473280680ea8c606a6f81a2284f1dfdb342d0b8cfb5f11d46e9c3bbf6c0
93b269a798d6f7fe12ea356800b0cc9ef94c5660a37942eb84e270fd8ca66f4e
251179b9af6ae4bbd5c28871bf9eac9d741dc5d612a6b45ba9b34a6f75ff752d
2f40debc6807498117faaef4591d235c43f055fb2b84e9545da2227badd2d332
2534fee3f570f77d931057a923c1d8301f037976388d7cb52b74f2f2a8c57947
779d106b0351a7d65d7c673dc2b2f3c7528fa57ce425fff2f4e2a43497e250bb
6156d510402fb6b81f25b5e1b25cea830bc3a1b85a9e13544216a182af29cb38
d16c1c38aed4c19548110fff2190b0414fc2b3c6dabc5b4685b7679e9e2e7c9d
d87694d40d4df1dfee693feee6eb227846c6b582a49c1559e0bb9fc6e137731f
dc4199856a1b982c8fe81db383227c0cfd12d57f7435f09b2c599643948f8649
20e6f640c05d4d9708ba46ab5fcd376a08b6625a08306765cd075ab313e98267
fdb26a3958405924da43eb75cbec07ee42304cc3ccdc76608e25a31045307c5b
f95df455cb744f07de8edad05c5f373f5248186ff0a47137d17cf784793caea5
836414ff49a389b922b04abaaabb4034225e6217506e3404ca6cfe45b0551152
2a812f03fb29a419937f3d4a003a78fd3a405534fd7db2db0f3bc709378a1ea5
22b1f2c4f7df3a7ed6cb338f967925c4deca9c1540c82947796039e11d1976ca
Total reclaimed space: 28.09GB

# crictl pull redacted.com/redacted/redacted-image:redacted-tag
FATA[0001] pulling image: rpc error: code = Unknown desc = Error committing the finished image: error adding layer with blob "sha256:ecda1bc969d753f5553712497752134d01985271a430d3164c498c4d7a9cae8a": error creating layer with ID "62a09a3264ffd51fb1cd3d9846363049c6f1a68879bb4d7e433d1eb802f8516f": Stat /var/lib/containers/storage/overlay/21721fd694065db2bc02f5cb9e8ff346f37df977cf76aefade8909eda563b367: no such file or directory
# podman pullredacted.com/redacted/redacted-image:redacted-tag
Trying to pull redacted.com/redacted/redacted-image:redacted-tag...
Getting image source signatures
Copying blob e37d2a49c27a skipped: already exists
Copying blob 224209d03549 skipped: already exists
Copying blob f48c65d739e3 skipped: already exists
Copying blob 6a61ef70c48c skipped: already exists
Copying blob c01be7b8d39b skipped: already exists
Copying blob 4f41d686a412 skipped: already exists
Copying blob f3747d810064 skipped: already exists
Copying blob 1ad11f6c7f8c skipped: already exists
Copying blob ed18e28c67fd skipped: already exists
Copying blob 962589ce2065 skipped: already exists
Copying blob bb529a6f9580 skipped: already exists
Copying blob 8f0cbd047b92 skipped: already exists
Copying blob 56f5a4bef208 skipped: already exists
Copying blob 2e3f9cf6b6bb skipped: already exists
Copying blob 0e92f1c3a218 done
Copying blob f0c74ddadfd6 done
Copying blob b63beb2e954b done
Copying blob 7ddc9953a13c done
Copying blob ecda1bc969d7 done
Copying blob 7741a55d4279 done
Copying blob 0738ef7ab41d done
Copying blob a575eb816d6c done
Copying blob ceb596d4af77 done
Copying blob 5082350cbb6e done
Copying blob 9ba7fa5cd964 done
Copying blob 820bbbaba607 done
Error: writing blob: adding layer with blob "sha256:ecda1bc969d753f5553712497752134d01985271a430d3164c498c4d7a9cae8a": error creating layer with ID "62a09a3264ffd51fb1cd3d9846363049c6f1a68879bb4d7e433d1eb802f8516f": Stat /var/lib/containers/storage/overlay/21721fd694065db2bc02f5cb9e8ff346f37df977cf76aefade8909eda563b367: no such file or directory
flouthoc commented 2 years ago

@matthias50 Sorry I was not clear enough :\ but my intention was to not invoke crictl pull at all but only invoke podman pull. But still thanks for trying this out, I'll try to reproduce this with the suggested crictl version.

matthias50 commented 2 years ago

@matthias50 Ugh. And of course I did the crictl pull first. When this happens again, I'll try to catch it and perform just the podman pull (like I wrote in the instructions for others, but didn't follow myself :laughing: ).

That said, when I did the podman system prune and the pulls, there was nothing running on the node except some daemonsets and static pods and the node was tainted to prevent new pods from being scheduled there. So, to me, it seems like if podman system prune had cleaned up the local corruption created from some earlier race condition, then I would have expected pulls with both crictl and podman for the broken image on a quiet node after performing the cleanup.

matthias50 commented 2 years ago

@flouthoc. This happened again and this time we tried the correct sequence of podman system prune followed by podman pull

This time podman system prune didn't do anything (probably because the pods were still trying to run)

podman system prune --all --force
Deleted Images
Total reclaimed space: 0B

and podman pull failed in the same way as before.

saschagrunert commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

flouthoc commented 1 year ago

I am actually able to reproduce this but by manually adding failures i.e skipping certain parent layers by instrumenting code here https://github.com/containers/storage/blob/main/drivers/overlay/overlay.go#L911

func (d *Driver) create(id, parent string, opts *graphdriver.CreateOpts, disableQuota bool) (retErr error) {

    dir := d.dir(id)
    if id == "any parent layer" {
        fmt.Printf("not commiting this parent\n")
        os.RemoveAll(dir)
        return nil
    }

So I get

podman pull nginx
Resolved "nginx" as an alias (/home/fl/.cache/containers/short-name-aliases.conf)
Trying to pull docker.io/library/nginx:latest...
Getting image source signatures
Copying blob fe0ef4c895f5 done  
Copying blob f4407ba1f103 done  
Copying blob 935cecace2a0 done  
Copying blob fe0ef4c895f5 done  
Copying blob f4407ba1f103 done  
Copying blob 935cecace2a0 done  
Copying blob b85a868b505f done  
Copying blob 4a7307612456 done  
Copying blob fe0ef4c895f5 done  
Copying blob f4407ba1f103 done  
Copying blob 935cecace2a0 done  
Copying blob b85a868b505f done  
Copying blob 4a7307612456 done  
Copying blob 8f46223e4234 done  
Error: copying system image from manifest list: writing blob: adding layer with blob "sha256:4a7307612456a7f65365e1da5c3811df49cefa5a2fd68d8e04e093d26a395d60": stat /home/fl/.local/share/containers/storage/overlay/da1f8336da190a692b3542627504aec6d743f9017bbca328c30144e81d0476d7: no such file or directory

However I had to manually introduce this failure i still think this should not persists in newer versions of c/storage since on newer versions of c/storage the parent layer gets recreated on the pull if its missing. I have not yet looked into https://github.com/containers/storage/tree/v1.32.3 which is affected for requested crio version but i suspect issue could be in older c/storage.

haircommander commented 1 year ago

@flouthoc are you able to help pinpoint the commits that fix in 1.32.3? maybe we can backport to affected branches in cri-o

matthias50 commented 1 year ago

@haircommander, @flouthoc, we have recently updated to crio 1.22.5. Does this crio have the updated c/storage which should not have this issue? If not, which version does?

haircommander commented 1 year ago

@flouthoc if you can point me to the storage commit that recreates the layers, I can find which version it is in cri-o

flouthoc commented 1 year ago

@haircommander @matthias50 I was unable to reproduce this when i tried this last-time without manual instrumentation of code but sure let me try again and see if I can reproduce this and point to the exact commit. Thanks

twojenski commented 1 year ago

I'm still getting this same problem:


Trying to pull docker.io/pihole/pihole:latest...
Getting image source signatures
Copying blob c48dab7e362c skipped: already exists
Copying blob 280ee136bb30 skipped: already exists
Copying blob dc1f00a5d701 skipped: already exists
Copying blob 5cf51dc31d77 skipped: already exists
Copying blob 4f4fb700ef54 skipped: already exists
Copying blob 984b787b496f skipped: already exists
Copying blob 5dc7355bea16 done
Copying blob 8c6657203dd0 done
Copying blob 3734d50778a6 done
Error: writing blob: adding layer with blob "sha256:8c6657203dd0e0345377939ca0b052448cac7aff8ecd585cccf5966d8625026f": error creating layer with ID "1ef6857b401ce939a6ed3ee16ed2c19b5e110215f4e9b327f21b1f401820f49c": Stat /var/lib/containers/storage/vfs/dir/d341f455a6ab37853a590a92d3b74178809399fefc85d3485735d7f5406a5286: no such file or directory

My podman version:


root@UDM-SE-PRO:/# podman version
Version:      3.4.4
API Version:  3.4.4
Go Version:   go1.16.15
Git Commit:   f6526ada1025c2e3f88745ba83b8b461ca659933-dirty
Built:        Thu Jun 23 14:19:44 2022
OS/Arch:      linux/arm64

I manually created: /var/lib/containers/storage/vfs/dir/d341f455a6ab37853a590a92d3b74178809399fefc85d3485735d7f5406a5286 and not it worked....

Any suggestions on permanent fix?

flouthoc commented 1 year ago

@twojenski Would it be possible to share a small reproducer or environment details. I pulled exact same podman and tried reproducing on vfs storage but I'm not able to reproduce, I wonder if its something in my environment that i am unable to reproduce this original issue at all.

[fl@fedora bin]$ ./podman --storage-driver vfs rmi --all -f
Untagged: docker.io/library/alpine:latest
Deleted: d7d3d98c851ff3a95dbcb70ce09d186c9aaf7e25d48d55c0f99aae360aecfd53
[fl@fedora bin]$ ./podman --storage-driver vfs pull alpine
Resolved "alpine" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull docker.io/library/alpine:latest...
Getting image source signatures
Copying blob 530afca65e2e done  
Copying config d7d3d98c85 done  
Writing manifest to image destination
Storing signatures
d7d3d98c851ff3a95dbcb70ce09d186c9aaf7e25d48d55c0f99aae360aecfd53
[fl@fedora bin]$ ./podman --storage-driver vfs rmi --all -f
Untagged: docker.io/library/alpine:latest
Deleted: d7d3d98c851ff3a95dbcb70ce09d186c9aaf7e25d48d55c0f99aae360aecfd53
[fl@fedora bin]$ ./podman --storage-driver vfs pull alpine
Resolved "alpine" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull docker.io/library/alpine:latest...
Getting image source signatures
Copying blob 530afca65e2e done  
Copying config d7d3d98c85 done  
Writing manifest to image destination
Storing signatures
d7d3d98c851ff3a95dbcb70ce09d186c9aaf7e25d48d55c0f99aae360aecfd53
[fl@fedora bin]$ ./podman --storage-driver vfs rmi --all -f
Untagged: docker.io/library/alpine:latest
Deleted: d7d3d98c851ff3a95dbcb70ce09d186c9aaf7e25d48d55c0f99aae360aecfd53
[fl@fedora bin]$ ./podman --storage-driver vfs pull alpine
Resolved "alpine" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull docker.io/library/alpine:latest...
Getting image source signatures
Copying blob 530afca65e2e done  
Copying config d7d3d98c85 done  
Writing manifest to image destination
Storing signatures
d7d3d98c851ff3a95dbcb70ce09d186c9aaf7e25d48d55c0f99aae360aecfd53
[fl@fedora bin]$ ./podman --storage-driver vfs rmi --all -f
Untagged: docker.io/library/alpine:latest
Deleted: d7d3d98c851ff3a95dbcb70ce09d186c9aaf7e25d48d55c0f99aae360aecfd53
[fl@fedora bin]$ ./podman --storage-driver vfs pull alpine
Resolved "alpine" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull docker.io/library/alpine:latest...
Getting image source signatures
Copying blob 530afca65e2e done  
Copying config d7d3d98c85 done  
Writing manifest to image destination
Storing signatures
d7d3d98c851ff3a95dbcb70ce09d186c9aaf7e25d48d55c0f99aae360aecfd53
[fl@fedora bin]$ ./podman version
 resolve 
Version:      3.4.4
API Version:  3.4.4
Go Version:   go1.16.11
Git Commit:   f6526ada1025c2e3f88745ba83b8b461ca659933
Built:        Mon Aug  8 12:02:28 2022
OS/Arch:      linux/amd64
flouthoc commented 1 year ago

@matthias50 Is it similar to your case, does the stuck image gets resolved if you create missing directory manually ?

terinjokes commented 1 year ago

I'm seeing this intermediately on cri-o 1.24.1. Like OP, I'm using the overlay driver. Restarting cri-o (and the kubelet) seems to have no effect.

crio --version ``` $ crio --version crio version 1.24.1 Version: 1.24.1 GitCommit: 93477cd21f579303bf2e5b29bdb4ae76d29252bf GitTreeState: clean BuildDate: 2022-06-22T20:11:19Z GoVersion: go1.18.3 Compiler: gc Platform: linux/amd64 Linkmode: dynamic BuildTags: seccomp, selinux SeccompEnabled: true AppArmorEnabled: false ```
matthias50 commented 1 year ago

@flouthoc, it does seem similar. iirc, when I created the directory, it fixed one error but then there were more and I couldn't find a way to reliably fix nodes broken in this way. Instead, we put in place automation which watches for pods to be added or modified. When these events occur it will then it performs further checks and if it detects a pod stuck initializing with a corresponding event containing the text "error creating layer with ID", it taints the node and deletes the pod. Then the stuck pod can be rescheduled somewhere else.

Here is the exact method we use to identify pods which are "stuck" with this issue:

func podHasCorruptedImageLayer(clientset kubernetes.Interface, pod *corev1.Pod) bool {
    for _, containerStatus := range pod.Status.ContainerStatuses {
        if containerStatus.State.Waiting != nil && containerStatus.State.Waiting.Reason == "PodInitializing" {
            events, _ := clientset.CoreV1().Events(pod.Namespace).List(context.TODO(), metav1.ListOptions{FieldSelector: fmt.Sprintf("involvedObject.name=%s", pod.Name), TypeMeta: metav1.TypeMeta{Kind: "Pod"}})
            for _, item := range events.Items {
                if strings.Contains(item.Message, "error creating layer with ID") {
                    return true
                }
            }
        }
    }
    return false
}
matthias50 commented 1 year ago

@flouthoc, I see from this comment that you are using vfs for the replication attempts:

I pulled exact same podman and tried reproducing on vfs storage but I'm not able to reproduce

Perhaps this issue only occurs when overlayfs is is in use as suggested by @terinjokes.

flouthoc commented 1 year ago

@flouthoc, I see from this comment that you are using vfs for the replication attempts:

@matthias50 I only used vfs just to reproduce scenario shared here https://github.com/cri-o/cri-o/issues/5793#issuecomment-1207440047 by @terinjokes , I guess the issue happens with both vfs and overlay but for some reason i am unable to reproduce this on my local with the shared version.

fschrempf commented 1 year ago

I'm seeing this one with podman on my local machine after a filesystem error that turned my Btrfs partition read-only. After a reboot the filesystem works fine again, but I guess there might be some corruption affecting podman. At the moment I have no idea how to recover from this.

> podman -v
podman version 4.3.0

> podman pull docker.io/kontron/docker-ktn-docs 
Trying to pull docker.io/kontron/docker-ktn-docs:latest...
Getting image source signatures
Copying blob 50024b0106d5 skipped: already exists  
Copying blob 8659dae93050 done  
Copying blob ff95660c6937 skipped: already exists  
Copying blob 9c7d0e5c0bc2 done  
Copying blob 29c4fb388fdf done  
Copying blob 16ea0e8c8879 skipped: already exists  
Copying blob 1da0ab556051 done  
Copying blob e92ae9350d4a done  
Copying blob c648cb7fc575 done  
Copying blob 28d01d9a098a done  
Copying blob ae3e67e17159 done  
Copying blob 4abcc637eff1 done  
Copying blob 4f4fb700ef54 done  
Copying blob 10744c13eff9 done  
Copying blob 4f4fb700ef54 skipped: already exists  
Copying blob 8b84068717ab done  
Copying blob 9e02b3ce0c17 done  
Copying blob 61d29492cd1e done  
Copying blob 7102a8bcb7c5 done  
Error: writing blob: adding layer with blob "sha256:9c7d0e5c0bc204b3a36e3f8ff320741da0bd0225e0a67e224c6265c1e208f80a": creating layer with ID "c203a5e0f15b2b1bc9f87453a1473ad7481a724c5e5414c2551f7c3cbf3de3e5": Stat /home/frieder/.local/share/containers/storage/overlay/51f4237d89596ad592cdc809ea897fc002ce3ae38e903e623cd2ae0835cc87db: no such file or directory
fschrempf commented 1 year ago

At the moment I have no idea how to recover from this.

Doing a hard podman system reset did the trick. I had no important data on this host so this was not a problem in this case.

terinjokes commented 1 year ago

I took a look at this today, and based on the error message and strace of the crio service, it looks to me like this is failing here, while looking up the parent layer's directory: https://github.com/cri-o/cri-o/blob/642f60c471b6746652b1671637cbd17a07da5fcf/vendor/github.com/containers/storage/drivers/overlay/overlay.go#L941-L944

I'm not sure why crio thinks the parent layer is already fetched, or how to go about getting it to fetch again.

Edit: After re-reading the comment from @flouthoc above, I looked at the vfs driver, which has the exact same stat block.

PaulFurtado commented 1 year ago

We just moved our clusters over from docker to cri-o and we started hitting this on one node today.

Both crictl pull and podman pull on the node think that the layer is already downloaded:

# crictl pull docker.hubteam.com/vitess-internal/upgraded-tablet:4388
Image is up to date for docker.hubteam.com/vitess-internal/upgraded-tablet@sha256:9443c2cfee872f2d74af66235bc1e59843dc77d34cbce285519a01537b317047
# podman pull docker.hubteam.com/vitess-internal/upgraded-tablet:4388
Trying to pull docker.hubteam.com/vitess-internal/upgraded-tablet:4388...
Getting image source signatures
Copying blob c69a0f32cee8 skipped: already exists  
Copying blob ff3f6419bc6e skipped: already exists  
Copying blob 1924e9fe790e skipped: already exists  
Copying blob cf5e487f6a44 skipped: already exists  
Copying blob e92ac2adfcb0 skipped: already exists  
Copying blob 5199db4ecb15 skipped: already exists  
Copying blob e1da3660d129 skipped: already exists  
Copying blob 2b646fb7f17c skipped: already exists  
Copying blob 26f10ba24b59 skipped: already exists  
Copying blob 4ba8cb06dbda skipped: already exists  
Copying blob d3dc35a669f4 skipped: already exists  
Copying blob a796f82fe7a4 skipped: already exists  
Copying blob 7f0c92cd04b1 skipped: already exists  
Copying blob b2643b7c3954 skipped: already exists  
Copying blob c9a4177032d9 skipped: already exists  
Copying blob 7a93529aae7e skipped: already exists  
Copying blob ee46f08a091a skipped: already exists  
Copying blob 03da66024bd6 skipped: already exists  
Copying config b9063df52f done  
Writing manifest to image destination
Storing signatures
b9063df52f2278ce375289b4c0183d478b05ff2e33faa787da5740422db30c27

But then attempting to run it results in:

# podman run --net=host -it docker.hubteam.com/vitess-internal/upgraded-tablet:4388 bash
Error: stat /usr/share/hubspot/containers/storage/overlay/1ea2a31ba8d8f1df5645dd2ca11f65a4aa5540116e3a061ff98812b5299db815: no such file or directory

We are using cri-o 1.23.4 from the static builds of cri-o published on the cri-o website and the podman version is podman-4.0.2-8.module_el8.7.0+1197+29cf2b8e from the centos 8 stream repos.

What other debugging information would be helpful here?

PaulFurtado commented 1 year ago

In my case, doing:

crictl rmi docker.hubteam.com/vitess-internal/upgraded-tablet:4388

actually did fix the problem, but from reading the other comments on this issue, I am assuming that only worked because:

  1. There were no existing containers on the system running with that image
  2. The problematic layer was not shared by any other images on the system
david-pech-sestraemmy-cz commented 1 year ago

He have experienced the same behaviour on Oracle Cloud managed Kubernetes crio version INFO[2023-03-23 09:21:52.243495403Z] Starting CRI-O, version: 1.25.1-111.el8, git: unknown(clean)

PaulFurtado commented 1 year ago

I currently have a node in this state and in this case, the image that is failing to pull is not already pulled, so I'm guessing the layer in question is in use by another image.

This is cri-o 1.23.4. podman 4.0.2 is also installed on the system and it also fails:

# podman pull docker.hubteam.com/vitess-internal/upgraded-tablet:4434
Trying to pull docker.hubteam.com/vitess-internal/upgraded-tablet:4434...
Getting image source signatures
Copying blob e9d7efbb3d37 skipped: already exists  
Copying blob 2d1e21591246 skipped: already exists  
Copying blob 6a1cbf9afc22 skipped: already exists  
Copying blob ff3f6419bc6e skipped: already exists  
Copying blob e92ac2adfcb0 skipped: already exists  
Copying blob 231b4248e2da skipped: already exists  
Copying blob 88c52981a657 skipped: already exists  
Copying blob 8c10ea2dcb59 skipped: already exists  
Copying blob ff917acd54b0 skipped: already exists  
Copying blob ed151504079f skipped: already exists  
Copying blob cc40a525c997 skipped: already exists  
Copying blob d842307651c3 skipped: already exists  
Copying blob 5199db4ecb15 skipped: already exists  
Copying blob 1e73d59146b0 done  
Copying blob 8408644a9cc0 done  
Copying blob 34ec39b6e3f5 done  
Copying blob c18c2d06c99f done  
Copying blob 464ea4e8d9fa done  
Error: writing blob: adding layer with blob "sha256:34ec39b6e3f542082149b23b1d0481083a5ad896dc929138818214d6c4a57097": error creating layer with ID "a39bcc2ef8e5aec7af3980256ac406e71b87888d67d21e108cbe1d1b3dc6d419": Stat /usr/share/hubspot/containers/storage/overlay/662cc199e4b7b3b99cf2de5aee17c2000f2772886aca87ffe96e56c6c32ae8a8: no such file or directory

Because someone suggested that a newer version of containers/image may fix this, I built an installed podman 4.4.4 on this system and attempted to pull there too. The error message changes slightly, but the same problem still exists:

Error: copying system image from manifest list: writing blob: adding layer with blob "sha256:34ec39b6e3f542082149b23b1d0481083a5ad896dc929138818214d6c4a57097": creating layer with ID "a39bcc2ef8e5aec7af3980256ac406e71b87888d67d21e108cbe1d1b3dc6d419": Stat /usr/share/hubspot/containers/storage/overlay/662cc199e4b7b3b99cf2de5aee17c2000f2772886aca87ffe96e56c6c32ae8a8: no such file or directory

Would a copy of the blob-info-cache-v1.boltdb file help here?

haircommander commented 1 year ago

there were a number of c/storage improvements around image pulling that landed in cri-o 1.26.2. Are folks able to try it out to see if that helps?

PaulFurtado commented 1 year ago

@haircommander is it fine to run cri-o 1.26 with kubernetes 1.23? The only reason we've been sticking to 1.23 is because I thought the cri-o version needed to match the kubernetes version.

haircommander commented 1 year ago

I'm afraid that won't work you'd need to also update to kube 1.26.

Would a copy of the blob-info-cache-v1.boltdb file help here?

@flouthoc would that help you investigate here?

PaulFurtado commented 1 year ago

Are the fixes you're talking about in cri-o 1.26.2 actually just a version update of containers/image or were there also cri-o code changes?

I'm asking because I could try building podman against the latest containers/image and testing it on the node that I have in this state to see if it solves the problem.

haircommander commented 1 year ago

Are the fixes you're talking about in cri-o 1.26.2 actually just a version update of containers/image or were there also cri-o code changes?

just c/image and c/storage updates

I'm asking because I could try building podman against the latest containers/image and testing it on the node that I have in this state to see if it solves the problem.

true! Though, have you been able to reproduce from a podman pull? I think the entity pulling may need to have the updated code

PaulFurtado commented 1 year ago

@haircommander I have not reproduced podman getting the system into the broken state, but when the system is in this broken state, podman pull is also broken.

Pulling using podman 4.4 also does not correct the problem. I also tried updating it from containers/image 1.24.1 to 1.24.2 to match cri-o and that also doesn't resolve the problem, and it looks like podman 4.4 is already using a newer version of containers/storage than cri-o is.

laxmanvallandas commented 1 year ago

This has become more frequent than intermittent especially after updating to crio 1.24.2 and kubernetes version 1.24.11.

crictl pull redacted-image:redacted-tag
E0403 09:41:59.186684 2261006 remote_image.go:238] "PullImage from image service failed" err="rpc error: code = Unknown desc = writing blob: adding layer with blob \"sha256:redacted\": error creating layer with ID \"redacted\": Stat /var/lib/containers/storage/overlay/redacted: no such file or directory" image="[redacted-image:redacted-tag]"
FATA[0001] pulling image: rpc error: code = Unknown desc = writing blob: adding layer with blob "sha256:redacted": error creating layer with ID "redacted": Stat /var/lib/containers/storage/overlay/redacted: no such file or directory
terinjokes commented 1 year ago

Are the fetched layer IDs stored in that boltdb database? If so, is it possible for this to be a data race?

laxmanvallandas commented 1 year ago

@terinjokes , We are using Amazon S3 as blobstore.

flouthoc commented 1 year ago

I'm afraid that won't work you'd need to also update to kube 1.26.

Would a copy of the blob-info-cache-v1.boltdb file help here?

@flouthoc would that help you investigate here?

@haircommander upon thinking no not really I don't have those blobs, blob-info-cache is just the database my host do not have those blobs.

Edit: Let me check this closely again and discuss with storage maintainers, I think this problem is overlapping with some of the similar issues which is getting fixed in upstream. I'll update here.

matthias50 commented 1 year ago

I really hope this can be fixed soon and in 1.24, not some future version. This bug is over a year old and things seem to be getting worse, not better given our experience with updating from 1.23 to 1.24. Right now, we've having to running kubelet/crio at 1.23 with the rest of the cluster at 1.24 so we can mostly unblock our upgrade, but that can only work as a stopgap as eventually we will have to update kubelet/crio.

PaulFurtado commented 1 year ago

Are the fetched layer IDs stored in that boltdb database? If so, is it possible for this to be a data race?

Yes, it appears so. I was trying to debug "where could this bad state possibly be stored?" I grepped through the storage directories to figure out where these IDs were coming from and the only place on disk was blob-info-cache-v1.boltdb.

I used a random tool to dump the db to json. For the error:

Error: writing blob: adding layer with blob "sha256:34ec39b6e3f542082149b23b1d0481083a5ad896dc929138818214d6c4a57097": error creating layer with ID "a39bcc2ef8e5aec7af3980256ac406e71b87888d67d21e108cbe1d1b3dc6d419": Stat /usr/share/hubspot/containers/storage/overlay/662cc199e4b7b3b99cf2de5aee17c2000f2772886aca87ffe96e56c6c32ae8a8: no such file or directory

We can see that there is a mapping for sha256:34ec39b6e3f542082149b23b1d0481083a5ad896dc929138818214d6c4a57097 to its uncompressed digest and a reverse mapping too. There is also an entry for it in a knownLocations map.

That said, I blew away that cache file and attempted another pull with podman and it did not fix the problem so the bad state must not be in that file.

Digging a little further, the 662cc199e4b7b3b99cf2de5aee17c2000f2772886aca87ffe96e56c6c32ae8a8 ID appears to be stored in /var/lib/containers/storage/overlay-layers/layers.json:

    {
        "id": "662cc199e4b7b3b99cf2de5aee17c2000f2772886aca87ffe96e56c6c32ae8a8",
        "parent": "34bdf373ca69e38d30cf57c1ae2e66a71627aead1eb206aabef7efba3ec9bd62",
        "created": "2023-03-30T12:31:12.235362536Z",
        "compressed-diff-digest": "sha256:d842307651c36561618eddb5020df85d858c89036e79b45b4c74082835605790",
        "compressed-size": 5697001,
        "diff-digest": "sha256:48252b6983a41f51d71cb60d1517450ea59873113981ed660f6337af0c7ac9e8",
        "diff-size": 24644608,
        "compression": 2,
        "uidset": [
            0
        ],
        "gidset": [
            0
        ]
    }

If I manually remove that entry from layers.json, podman happily pulls the layer, so that's where our problematic state actually is.

Looking at the code that manages that file, it does seem like there are plenty of opportunities for race conditions/locking bugs, for example: https://github.com/containers/storage/blob/0f3b7423dca78dec2bfd47fb793586dcf5350406/layers.go#L559-L588

It appears that code puts a lot of effort into avoiding re-reading and re-writing layers.json unnecessarily. If this bug doesn't occur on podman systems, that may be the answer here. As far as I know, the podman processes are stateless so every time a pull occurs, layers.json would be re-read, whereas in the cri-o case, cri-o could have some stale state in memory.

Perhaps a reproducer is to just have a bunch of processes use cri-o to pull and remove images that have shared layers in parallel.

joshcoughlan commented 1 year ago

I had this happen today. I removed the offending entry in layers.json and it appeared to start working again. After removing the entry and re-pulling the image I verified that the entry was back in layers.json.

I did drain the node and clean off all images before putting the node back in service just for good measure, so I don't really know if things would have gone sideways had I left it in production with active workloads.

To add to the discussion, we have had this happen several times on nodes running cri-o and it seems to be triggered when multiple pods that use the same image start at the (exact) same time on the same node. In this occurrence the image in question was 243MB. Pulling the image fresh on an identical node in the same datacenter took 6.3 seconds.

matthias50 commented 1 year ago

We also saw this happen in a daemonset which didn't share any other layers with other pods on the node. However, this daemonset did have imagePullPolicy set to Always so maybe somehow repulling an image which already exists can get you into this as well.

evanfoster commented 1 year ago

I've been encountering this very frequently with CRI-O 1.24.4 and k8s v1.24.13. Is there information I could gather to assist with debugging?

TomSweeneyRedHat commented 1 year ago

Ping @mtrmac in case you've not seen this.