Closed phiremath closed 2 years ago
eek that's a weird one. seems like a deadlock in image storage. I've refined many of the goroutine stacks out in this file crio-goroutine-stacks-2022-03-30T211654Z-prod4-sn1.log
@nalind @giuseppe @vrothberg @mtrmac can you PTAL?
Note: I wouldn't expect most users to go from 1.18 right to 1.22. Did you clear the image store between upgrades? That may help stop it from happening intermittently
There was a lock file that wasn't being freed when we encountered an error while reloading layer information that had been modified by another process that @giuseppe fixed for storage 1.38, but the stacks also show the process waiting for locks on files that are presumably held by another process.
Looking at that log file, several goroutines also seem to be blocked on c/image/storage.storageImageDestination.lock
, and I can’t find any goroutine that is expected to hold the lock. (I also can’t find any code path that does not hold the lock; assuming there isn’t any record of a Go panic, is there?).
Either way, if other processes hang, that’s a c/storage lock, not the storageImageDestination.lock
.
We are seeing more and more of these errors with crio and I was wondering whether this is related?
time="2022-05-24 12:25:46.723941154+02:00" level=info msg="Running pod sandbox: rook-ceph/rook-ceph-crashcollector-pruner-27555840-xrwtz/POD" id=29b983cc-c344-4c11-9305-c6ca7fa6c07a name=/runtime.v1.RuntimeService/RunPodSandbox
time="2022-05-24 12:25:46.724051699+02:00" level=warning msg="error reserving pod name k8s_rook-ceph-crashcollector-pruner-27555840-xrwtz_rook-ceph_c18c65dd-0709-4870-9a83-7848b245c86e_0 for id 0f06434be6d9e5decb1cb45a5ff89e1c58bb4c9466503af656a1e532e560b557: name is reserved"
time="2022-05-24 12:25:46.724114198+02:00" level=info msg="Creation of sandbox k8s_rook-ceph-crashcollector-pruner-27555840-xrwtz_rook-ceph_c18c65dd-0709-4870-9a83-7848b245c86e_0 not yet finished. Waiting up to 6m0s for it to finish" id=29b983cc-c344-4c11-9305-c6ca7fa6c07a name=/runtime.v1.RuntimeService/RunPodSandbox
Symptom: pods don't start anymore, stay in ContainerCreating
state (k8s wise).
could you get me the crio goroutine stacks as described here @telmich
A friendly reminder that this issue had no activity for 30 days.
Closing this issue since it had no activity in the past 90 days.
What happened?
Upgrading from crio 1.18.4 to crio 1.22.3. Intermittent issue. Attached crio config, logs and stack files.
pod are stuck in ContainerCreating
skopeo copy to containers-storage hangs
crictl images hangs
crictl ps works
crio log shows several/repeated errors similar to:
crio stack trace show several goroutines waiting similar to below:
pod, show similar error as below:
Crio Log: crio-prod4-sn1.log
Crio stacks: crio-goroutine-stacks-2022-03-30T211654Z-prod4-sn1.log
Crio config: crio-1.22.3-nd.conf.txt
What did you expect to happen?
skopeo copy to work crictl images to show images create pods to succeed
How can we reproduce it (as minimally and precisely as possible)?
Bug is intermittent. Occurs sometime during upgrade from crio 1.18.4 to 1.22.3 and sometime during node reboot with crio 1.22.3.
Anything else we need to know?
crio 1.22.3 has being modified to not use hostport manager ref commit
CRI-O and Kubernetes version
OS version
Additional environment details (AWS, VirtualBox, physical, etc.)