Upgrading from crio 1.18.4 to crio 1.22.3

phiremath commented 2 years ago

What happened?

Upgrading from crio 1.18.4 to crio 1.22.3. Intermittent issue. Attached crio config, logs and stack files.

pod are stuck in ContainerCreating

skopeo copy to containers-storage hangs

crictl images hangs

crictl ps works

crio log shows several/repeated errors similar to:

` Mar 29 08:03:08 ins15-prod4-sn1.ins15-prod4.case.local crio[5680]: time="2022-03-29 08:03:08.920067417Z" level=info msg="runSandbox: deleting pod ID f5341ba9bba15c24f471fe3c786517bdbab573c461 fdf4851c330afa95e3da29 from idIndex" id=a676d818-29e9-4ab7-affd-371ecc4b8c74 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Mar 29 08:03:08 ins15-prod4-sn1.ins15-prod4.case.local crio[5680]: time="2022-03-29 08:03:08.920131553Z" level=error msg="Error stopping network on cleanup: failed to destroy network for pod sandbox k8s_ns-7c65f97444-q6hvl_ns_9e2784c7-c9e1-4b76-a99c-bd9dd4eca410_0(f62c29cc52e51a7d0996f04dce0b0184953e400d98fab364afd61cae7e5c77e8): error removing pod ns_ns-7c65f97444-q6hvl from CNI network \"multus-cni\": Multus: [ns/ns-7c65f97444-q6hvl]: failed to get clusterNetwork/defaultNetworks: getNetDelegate: cannot find network: default" id=6556c4d4-c8f6-432a-aea6-6d6d1cbeb675 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Mar 29 09:13:52 ins15-prod4-sn1.ins15-prod4.case.local crio[5680]: time="2022-03-29 09:13:52.826866464Z" level=warning msg="error reserving pod name k8s_sfs-master-only-deployment-0_cisco-testapp_a51017af-fbe3-47e5-ac4f-fb606fc8e42e_0 for id 9f433faab1642bb65ceb0e36c5e9f03ea85f1cbc1517fa653384bde263b83f42: name is reserved" Mar 29 09:13:52 ins15-prod4-sn1.ins15-prod4.case.local crio[5680]: time="2022-03-29 09:13:52.826946835Z" level=info msg="Creation of sandbox k8s_sfs-master-only-deployment-0_cisco-testapp_a51017af-fbe3-47e5-ac4f-fb606fc8e42e_0 not yet finished. Waiting up to 6m0s for it to finish" id=60e3f865-1802-4b47-8a19-7c747581a0ac name=/runtime.v1alpha2.RuntimeService/RunPodSandbox` `

crio stack trace show several goroutines waiting similar to below:

` goroutine 791171 [semacquire, 2111 minutes]: sync.runtime_SemacquireMutex(0xc0005b2034, 0x459700, 0x1) runtime/sema.go:71 +0x47 sync.(*Mutex).lockSlow(0xc0005b2030) sync/mutex.go:138 +0x105 sync.(*Mutex).Lock(...) sync/mutex.go:81 sync.(*RWMutex).Lock(0xc0005b2030) sync/rwmutex.go:111 +0x90 github.com/containers/storage/pkg/lockfile.(*lockfile).lock(0xc0001b4050, 0xc001000001) github.com/containers/storage@v1.34.1/pkg/lockfile/lockfile_unix.go:116 +0x285 github.com/containers/storage/pkg/lockfile.(*lockfile).Lock(0xc0001b4050) github.com/containers/storage@v1.34.1/pkg/lockfile/lockfile_unix.go:149 +0x3d github.com/containers/storage.(*store).LayerStore(0xc0005b4000, 0x0, 0x0, 0x0, 0x0) github.com/containers/storage@v1.34.1/store.go:848 +0x6f github.com/containers/storage.(*store).Layer(0xc0005b4000, 0xc00048ec80, 0x40, 0x0, 0x0, 0x0) github.com/containers/storage@v1.34.1/store.go:3119 +0x57 github.com/containers/image/v5/storage.(*storageImageSource).getSize(0xc001bb0300, 0xc000488980, 0xc0002ac700, 0x25e3768) github.com/containers/image/v5@v5.15.2/storage/storage_image.go:1307 +0x10c github.com/containers/image/v5/storage.newImage(0x25cf798, 0xc000488980, 0xc0002ac700, 0x2606808, 0xc0005b4000, 0x0, 0x0, 0x0, 0x0, 0x0, ...) github.com/containers/image/v5@v5.15.2/storage/storage_image.go:1344 +0x127 github.com/containers/image/v5/storage.storageReference.NewImage(...) github.com/containers/image/v5@v5.15.2/storage/storage_reference.go:263 github.com/cri-o/cri-o/internal/storage.(*imageService).buildImageCacheItem(0xc0005281c0, 0xc0002ac700, 0x25e94b8, 0xc00112c900, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) github.com/cri-o/cri-o/internal/storage/image.go:217 +0xaa github.com/cri-o/cri-o/internal/storage.(*imageService).appendCachedResult(0xc0005281c0, 0xc0002ac700, 0x25e94b8, 0xc00112c900, 0xc000c439f0, 0xc001a88000, 0x11, 0x20, 0xc001519a10, 0xc001a88 000, ...) github.com/cri-o/cri-o/internal/storage/image.go:277 +0x3d6 github.com/cri-o/cri-o/internal/storage.(*imageService).ListImages(0xc0005281c0, 0xc0002ac700, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0) github.com/cri-o/cri-o/internal/storage/image.go:321 +0x43e github.com/cri-o/cri-o/server.(*Server).ListImages(0xc0004bc600, 0x25cf840, 0xc0015188a0, 0xc0007d3880, 0x2046240, 0x221b7c0, 0x25d8398) github.com/cri-o/cri-o/server/image_list.go:19 +0x87 github.com/cri-o/cri-o/server/cri/v1alpha2.(*service).ListImages(0xc0005b8058, 0x25cf840, 0xc0015188a0, 0xc000f0c750, 0xc0005b8058, 0x1, 0x1) github.com/cri-o/cri-o/server/cri/v1alpha2/rpc_list_images.go:24 +0xa5 k8s.io/cri-api/pkg/apis/runtime/v1alpha2._ImageService_ListImages_Handler.func1(0x25cf840, 0xc0015188a0, 0x2185a20, 0xc000f0c750, 0xc0007d3950, 0x1, 0x1, 0x0) k8s.io/cri-api@v0.22.1/pkg/apis/runtime/v1alpha2/api.pb.go:8649 +0x89 github.com/cri-o/cri-o/internal/log.UnaryInterceptor.func1(0x25cf840, 0xc001518840, 0x2185a20, 0xc000f0c750, 0xc000eaca80, 0xc000e2eff0, 0x4dc806, 0x6242d9f2, 0x22b83796, 0x6eeb0fe641c) github.com/cri-o/cri-o/internal/log/interceptors.go:56 +0x116 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x25cf840, 0xc001518840, 0x2185a20, 0xc000f0c750, 0xc000eacac0, 0x20, 0x20, 0x20c7840) github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63 github.com/cri-o/cri-o/server/metrics.UnaryInterceptor.func1(0x25cf840, 0xc001518840, 0x2185a20, 0xc000f0c750, 0xc000eaca80, 0xc000eacaa0, 0x1de99da, 0x20c7840, 0xc000eacac0, 0xc000eaca80) github.com/cri-o/cri-o/server/metrics/interceptors.go:24 +0xc6 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x25cf840, 0xc001518840, 0x2185a20, 0xc000f0c750, 0xc0009b1800, 0x0, 0xc000626b30, 0x416f58) github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1(0x25cf840, 0xc001518840, 0x2185a20, 0xc000f0c750, 0xc000eaca80, 0xc000e2eff0, 0xc000626ba0, 0x523666, 0x2110480, 0xc0015188 40) github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34 +0xd7 k8s.io/cri-api/pkg/apis/runtime/v1alpha2._ImageService_ListImages_Handler(0x221b7c0, 0xc0005b8058, 0x25cf840, 0xc001518840, 0xc000f17bc0, 0xc0009be3c0, 0x25cf840, 0xc001518840, 0x0, 0x0) k8s.io/cri-api@v0.22.1/pkg/apis/runtime/v1alpha2/api.pb.go:8651 +0x150 google.golang.org/grpc.(*Server).processUnaryRPC(0xc0002348c0, 0x25e9758, 0xc0000ff200, 0xc001a950e0, 0xc00047a240, 0x35a9a00, 0x0, 0x0, 0x0) google.golang.org/grpc@v1.40.0/server.go:1297 +0x52b google.golang.org/grpc.(*Server).handleStream(0xc0002348c0, 0x25e9758, 0xc0000ff200, 0xc001a950e0, 0x0) google.golang.org/grpc@v1.40.0/server.go:1626 +0xd05 google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc00089e280, 0xc0002348c0, 0x25e9758, 0xc0000ff200, 0xc001a950e0) google.golang.org/grpc@v1.40.0/server.go:941 +0xab created by google.golang.org/grpc.(*Server).serveStreams.func1 google.golang.org/grpc@v1.40.0/server.go:939 +0x1fd goroutine 3626366 [semacquire, 1667 minutes]: sync.runtime_SemacquireMutex(0xc00000cce4, 0x0, 0x1) runtime/sema.go:71 +0x47 sync.(*Mutex).lockSlow(0xc00000cce0) sync/mutex.go:138 +0x105 sync.(*Mutex).Lock(...) sync/mutex.go:81 sync.(*RWMutex).Lock(0xc00000cce0) sync/rwmutex.go:111 +0x90 github.com/cri-o/cri-o/server.(*Server).stopPodSandbox(0xc0004bc600, 0x25cf840, 0xc0022d7d70, 0xc00000cb40, 0x0, 0x0) github.com/cri-o/cri-o/server/sandbox_stop_linux.go:20 +0x70 github.com/cri-o/cri-o/server.(*Server).StopPodSandbox(0xc0004bc600, 0x25cf840, 0xc0022d7d70, 0xc002def878, 0x2153c00, 0x221b7c0) github.com/cri-o/cri-o/server/sandbox_stop.go:34 +0x3ee github.com/cri-o/cri-o/server/cri/v1alpha2.(*service).StopPodSandbox(0xc0005b8058, 0x25cf840, 0xc0022d7d70, 0xc002b238a8, 0xc0005b8058, 0x1, 0x1) github.com/cri-o/cri-o/server/cri/v1alpha2/rpc_stop_pod_sandbox.go:14 +0x6e k8s.io/cri-api/pkg/apis/runtime/v1alpha2._RuntimeService_StopPodSandbox_Handler.func1(0x25cf840, 0xc0022d7d70, 0x21871e0, 0xc002b238a8, 0xc002def950, 0x1, 0x1, 0x42) k8s.io/cri-api@v0.22.1/pkg/apis/runtime/v1alpha2/api.pb.go:8077 +0x8b github.com/cri-o/cri-o/internal/log.UnaryInterceptor.func1(0x25cf840, 0xc0022d7d10, 0x21871e0, 0xc002b238a8, 0xc002dc4e00, 0xc002b238c0, 0x4dc806, 0x624341f4, 0x3dcfc58, 0x1f25eb68bccf) github.com/cri-o/cri-o/internal/log/interceptors.go:56 +0x116 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x25cf840, 0xc0022d7d10, 0x21871e0, 0xc002b238a8, 0xc002dc4e40, 0x20, 0x20, 0x20c7840) github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63 github.com/cri-o/cri-o/server/metrics.UnaryInterceptor.func1(0x25cf840, 0xc0022d7d10, 0x21871e0, 0xc002b238a8, 0xc002dc4e00, 0xc002dc4e20, 0x1de99da, 0x20c7840, 0xc002dc4e40, 0xc002dc4e00) github.com/cri-o/cri-o/server/metrics/interceptors.go:24 +0xc6 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x25cf840, 0xc0022d7d10, 0x21871e0, 0xc002b238a8, 0xc000100000, 0x0, 0xc00062bb30, 0x416f58) github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1(0x25cf840, 0xc0022d7d10, 0x21871e0, 0xc002b238a8, 0xc002dc4e00, 0xc002b238c0, 0xc00062bba0, 0x523666, 0x2110480, 0xc0022d7d10) github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34 +0xd7 k8s.io/cri-api/pkg/apis/runtime/v1alpha2._RuntimeService_StopPodSandbox_Handler(0x221b7c0, 0xc0005b8058, 0x25cf840, 0xc0022d7d10, 0xc000deb4a0, 0xc0009be3c0, 0x25cf840, 0xc0022d7d10, 0xc000ea2000, 0x42) k8s.io/cri-api@v0.22.1/pkg/apis/runtime/v1alpha2/api.pb.go:8079 +0x150 google.golang.org/grpc.(*Server).processUnaryRPC(0xc0002348c0, 0x25e9758, 0xc00058db00, 0xc002334900, 0xc00047a1b0, 0x35b2190, 0x0, 0x0, 0x0) google.golang.org/grpc@v1.40.0/server.go:1297 +0x52b google.golang.org/grpc.(*Server).handleStream(0xc0002348c0, 0x25e9758, 0xc00058db00, 0xc002334900, 0x0) google.golang.org/grpc@v1.40.0/server.go:1626 +0xd05 google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc00037a370, 0xc0002348c0, 0x25e9758, 0xc00058db00, 0xc002334900) google.golang.org/grpc@v1.40.0/server.go:941 +0xab created by google.golang.org/grpc.(*Server).serveStreams.func1 google.golang.org/grpc@v1.40.0/server.go:939 +0x1fd goroutine 10469 [semacquire, 2233 minutes]: sync.runtime_SemacquireMutex(0xc0010daea4, 0x0, 0x1) runtime/sema.go:71 +0x47 sync.(*Mutex).lockSlow(0xc0010daea0) sync/mutex.go:138 +0x105 sync.(*Mutex).Lock(...) sync/mutex.go:81 github.com/containers/image/v5/storage.(*storageImageDestination).tryReusingBlobWithSrcRef(0xc0010dadc0, 0x25cf798, 0xc000488980, 0xc0010d47d0, 0x47, 0xf6e6be, 0x0, 0x0, 0x0, 0x0, ...) github.com/containers/image/v5@v5.15.2/storage/storage_image.go:562 +0x734 github.com/containers/image/v5/storage.(*storageImageDestination).TryReusingBlobWithOptions(0xc0010dadc0, 0x25cf798, 0xc000488980, 0xc0010d47d0, 0x47, 0xf6e6be, 0x0, 0x0, 0x0, 0x0, ...) github.com/containers/image/v5@v5.15.2/storage/storage_image.go:550 +0x11c github.com/containers/image/v5/copy.(*imageCopier).copyLayer(0xc0018915c0, 0x25cf798, 0xc000488980, 0xc0010d47d0, 0x47, 0xf6e6be, 0x0, 0x0, 0x0, 0x0, ...) github.com/containers/image/v5@v5.15.2/copy/copy.go:1238 +0x12e2 github.com/containers/image/v5/copy.(*imageCopier).copyLayers.func1(0x3, 0xc0010d47d0, 0x47, 0xf6e6be, 0x0, 0x0, 0x0, 0x0, 0xc000c5d590, 0x2b, ...) github.com/containers/image/v5@v5.15.2/copy/copy.go:936 +0x485 created by github.com/containers/image/v5/copy.(*imageCopier).copyLayers.func2 github.com/containers/image/v5@v5.15.2/copy/copy.go:973 +0x2ac `

pod, show similar error as below:

` Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 29s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = Kubelet may be retrying requests that are timing out in CRI-O due to system load: waited too long for request to timeout or sandbox k8s_porttest-ins15-prod4-sn1_default_ae9a915b8b9c79dfffb23921f0bd5075_0 to be created: error reserving pod name k8s_porttest-ins15-prod4-sn1_default_ae9a915b8b9c79dfffb23921f0bd5075_0 for id b1518cdc1c6e2637a522c0ed63080281524f2c10959761968cf1482dd4f3fec1: name is reserved `

Crio Log: crio-prod4-sn1.log

Crio stacks: crio-goroutine-stacks-2022-03-30T211654Z-prod4-sn1.log

Crio config: crio-1.22.3-nd.conf.txt

What did you expect to happen?

skopeo copy to work crictl images to show images create pods to succeed

How can we reproduce it (as minimally and precisely as possible)?

Bug is intermittent. Occurs sometime during upgrade from crio 1.18.4 to 1.22.3 and sometime during node reboot with crio 1.22.3.

Anything else we need to know?

crio 1.22.3 has being modified to not use hostport manager ref commit

CRI-O and Kubernetes version

```console $ crio --version crio version 1.22.3 Version: 1.22.3 GitCommit: 4b1f6acb6c185ad3d2ff59fc9e9534bac4428c5f GitTreeState: clean BuildDate: 2022-03-22T04:27:04Z GoVersion: go1.16.13 Compiler: gc Platform: linux/amd64 Linkmode: dynamic BuildTags: containers_image_openpgp, containers_image_ostree_stub, seccomp, selinux SeccompEnabled: true AppArmorEnabled: false ``` ```console $ kubectl --version $kubectl version Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.5", GitCommit:"5c99e2ac2ff9a3c549d9ca665e7bc05a3e18f07e", GitTreeState:"clean", BuildDate:"2021-12-16T08:38:33Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.5", GitCommit:"5c99e2ac2ff9a3c549d9ca665e7bc05a3e18f07e", GitTreeState:"clean", BuildDate:"2021-12-16T08:32:32Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"} ```

OS version

```console # On Linux: $ cat /etc/os-release NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7" $ uname -a Linux ins15-prod4-sn1.ins15-prod4.case.local 5.10.103stock-1 #1 SMP Fri Mar 4 22:23:09 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux ```

Additional environment details (AWS, VirtualBox, physical, etc.)

haircommander commented 2 years ago

eek that's a weird one. seems like a deadlock in image storage. I've refined many of the goroutine stacks out in this file crio-goroutine-stacks-2022-03-30T211654Z-prod4-sn1.log

@nalind @giuseppe @vrothberg @mtrmac can you PTAL?

Note: I wouldn't expect most users to go from 1.18 right to 1.22. Did you clear the image store between upgrades? That may help stop it from happening intermittently

nalind commented 2 years ago

There was a lock file that wasn't being freed when we encountered an error while reloading layer information that had been modified by another process that @giuseppe fixed for storage 1.38, but the stacks also show the process waiting for locks on files that are presumably held by another process.

mtrmac commented 2 years ago

Looking at that log file, several goroutines also seem to be blocked on c/image/storage.storageImageDestination.lock, and I can’t find any goroutine that is expected to hold the lock. (I also can’t find any code path that does not hold the lock; assuming there isn’t any record of a Go panic, is there?).

Either way, if other processes hang, that’s a c/storage lock, not the storageImageDestination.lock.

telmich commented 2 years ago

We are seeing more and more of these errors with crio and I was wondering whether this is related?

time="2022-05-24 12:25:46.723941154+02:00" level=info msg="Running pod sandbox: rook-ceph/rook-ceph-crashcollector-pruner-27555840-xrwtz/POD" id=29b983cc-c344-4c11-9305-c6ca7fa6c07a name=/runtime.v1.RuntimeService/RunPodSandbox
time="2022-05-24 12:25:46.724051699+02:00" level=warning msg="error reserving pod name k8s_rook-ceph-crashcollector-pruner-27555840-xrwtz_rook-ceph_c18c65dd-0709-4870-9a83-7848b245c86e_0 for id 0f06434be6d9e5decb1cb45a5ff89e1c58bb4c9466503af656a1e532e560b557: name is reserved"
time="2022-05-24 12:25:46.724114198+02:00" level=info msg="Creation of sandbox k8s_rook-ceph-crashcollector-pruner-27555840-xrwtz_rook-ceph_c18c65dd-0709-4870-9a83-7848b245c86e_0 not yet finished. Waiting up to 6m0s for it to finish" id=29b983cc-c344-4c11-9305-c6ca7fa6c07a name=/runtime.v1.RuntimeService/RunPodSandbox

Symptom: pods don't start anymore, stay in ContainerCreating state (k8s wise).

haircommander commented 2 years ago

could you get me the crio goroutine stacks as described here @telmich

saschagrunert commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] commented 2 years ago

Closing this issue since it had no activity in the past 90 days.

cri-o / cri-o