Closed Rakshith-R closed 1 year ago
Cephcsi uses getImageInfo()
at various places listed below.
If it returns false positive ErrImageNotFound
in situations where the image is still present(but undergoing flattening), it will leave stale images in ceph cluster.
https://github.com/ceph/ceph-csi/search?q=getImageInfo%28%29
Operations which may leave stale images are create/delete Snapshot
, create PVC from data source(PVC or snapshot)
& delete PVC created from another data source
.
Note: Task to flatten images is added by cephcsi when we hit snap / imageclonedepth limit. (So this combination occurring together may not be common)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This is waiting for a fix at ceph layer.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
from #2754
image list:
csi-vol-7f6b1b8d-721d-11ec-a974-52723c9f3c2f
csi-vol-8ae3a235-721d-11ec-a974-52723c9f3c2f
csi-vol-8ae3a235-721d-11ec-a974-52723c9f3c2f-temp
csi-vol-8afbd20e-721d-11ec-a974-52723c9f3c2f-temp
csi-vol-8b38ff57-721d-11ec-a974-52723c9f3c2f
csi-vol-8b38ff57-721d-11ec-a974-52723c9f3c2f-temp
csi-vol-8b579fd2-721d-11ec-a974-52723c9f3c2f
csi-vol-8b579fd2-721d-11ec-a974-52723c9f3c2f-temp
csi-vol-8bb95bc5-721d-11ec-a974-52723c9f3c2f
csi-vol-8bb95bc5-721d-11ec-a974-52723c9f3c2f-temp
csi-vol-8c0c4d7d-721d-11ec-a974-52723c9f3c2f
csi-vol-8c0c4d7d-721d-11ec-a974-52723c9f3c2f-temp
csi-vol-8c188f09-721d-11ec-a974-52723c9f3c2f
csi-vol-8c188f09-721d-11ec-a974-52723c9f3c2f-temp
csi-vol-9619a3a9-721d-11ec-a974-52723c9f3c2f
csi-vol-9619a3a9-721d-11ec-a974-52723c9f3c2f-temp
csi-vol-9622c1e0-721d-11ec-a974-52723c9f3c2f
csi-vol-9622c1e0-721d-11ec-a974-52723c9f3c2f-temp
csi-vol-96272a5d-721d-11ec-a974-52723c9f3c2f
csi-vol-96272a5d-721d-11ec-a974-52723c9f3c2f-temp
csi-vol-9c391ae8-721d-11ec-a974-52723c9f3c2f
csi-vol-9c391ae8-721d-11ec-a974-52723c9f3c2f-temp
stray image: csi-vol-8afbd20e-721d-11ec-a974-52723c9f3c2f-temp
[rakshith@fedora kubernetes-csi-addons]$ cat log.txt | grep "ID: 152 Req-ID"
I0110 13:59:35.234075 1 utils.go:191] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 GRPC call: /csi.v1.Controller/CreateVolume
I0110 13:59:35.234412 1 utils.go:195] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"topology.rbd.csi.ceph.com/region":"testregion","topology.rbd.csi.ceph.com/zone":"testzone"}}],"requisite":[{"segments":{"topology.rbd.csi.ceph.com/region":"testregion","topology.rbd.csi.ceph.com/zone":"testzone"}}]},"capacity_range":{"required_bytes":1073741824},"name":"pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36","parameters":{"clusterID":"c99878bd-af0b-4010-bebd-c71cd53e425d","csi.storage.k8s.io/pv/name":"pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36","csi.storage.k8s.io/pvc/name":"rbd-13182","csi.storage.k8s.io/pvc/namespace":"rbd-1318","dataPool":"ec-pool","imageFeatures":"layering","pool":"replicapool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4","mount_flags":["discard"]}},"access_mode":{"mode":1}}],"volume_content_source":{"Type":{"Volume":{"volume_id":"0001-0024-c99878bd-af0b-4010-bebd-c71cd53e425d-0000000000000002-7f6b1b8d-721d-11ec-a974-52723c9f3c2f"}}}}
I0110 13:59:35.234747 1 rbd_util.go:1296] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 setting disableInUseChecks: false image features: [layering] mounter: rbd
I0110 13:59:35.236069 1 omap.go:87] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 got omap values: (pool="replicapool", namespace="", name="csi.volume.7f6b1b8d-721d-11ec-a974-52723c9f3c2f"): map[csi.imageid:11dde88bffdd csi.imagename:csi-vol-7f6b1b8d-721d-11ec-a974-52723c9f3c2f csi.volname:pvc-f99fec9c-8c83-458c-80f8-4d33e2b5ad8d csi.volume.owner:rbd-1318]
I0110 13:59:35.253669 1 omap.go:87] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 got omap values: (pool="replicapool", namespace="", name="csi.volumes.default"): map[]
I0110 13:59:35.287632 1 rbd_util.go:816] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 clone depth is (0), configured softlimit (2) and hardlimit (6) for replicapool/csi-vol-7f6b1b8d-721d-11ec-a974-52723c9f3c2f
I0110 13:59:35.308288 1 omap.go:155] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 set omap keys (pool="replicapool", namespace="", name="csi.volumes.default"): map[csi.volume.pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36:8afbd20e-721d-11ec-a974-52723c9f3c2f])
I0110 13:59:35.311266 1 omap.go:155] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 set omap keys (pool="replicapool", namespace="", name="csi.volume.8afbd20e-721d-11ec-a974-52723c9f3c2f"): map[csi.imagename:csi-vol-8afbd20e-721d-11ec-a974-52723c9f3c2f csi.volname:pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 csi.volume.owner:rbd-1318])
I0110 13:59:35.311294 1 rbd_journal.go:482] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 generated Volume ID (0001-0024-c99878bd-af0b-4010-bebd-c71cd53e425d-0000000000000002-8afbd20e-721d-11ec-a974-52723c9f3c2f) and image name (csi-vol-8afbd20e-721d-11ec-a974-52723c9f3c2f) for request name (pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36)
I0110 13:59:35.311503 1 rbd_util.go:1375] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 rbd: snap create replicapool/@csi-vol-8afbd20e-721d-11ec-a974-52723c9f3c2f-temp using mon
I0110 13:59:35.930554 1 rbd_util.go:1434] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 rbd: clone replicapool/csi-vol-7f6b1b8d-721d-11ec-a974-52723c9f3c2f@csi-vol-8afbd20e-721d-11ec-a974-52723c9f3c2f-temp replicapool/csi-vol-8afbd20e-721d-11ec-a974-52723c9f3c2f-temp (features: [layering deep-flatten]) using mon rook-ceph-mon-a.rook-ceph.svc.cluster.local:6789
I0110 13:59:36.049488 1 rbd_util.go:1388] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 rbd: snap rm replicapool/csi-vol-7f6b1b8d-721d-11ec-a974-52723c9f3c2f@csi-vol-8afbd20e-721d-11ec-a974-52723c9f3c2f-temp using mon
I0110 13:59:36.157205 1 rbd_util.go:816] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 clone depth is (1), configured softlimit (4) and hardlimit (8) for replicapool/csi-vol-8afbd20e-721d-11ec-a974-52723c9f3c2f-temp
I0110 13:59:36.157242 1 rbd_util.go:1375] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 rbd: snap create replicapool/@csi-vol-8afbd20e-721d-11ec-a974-52723c9f3c2f using mon
I0110 13:59:37.043532 1 rbd_util.go:1434] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 rbd: clone replicapool/csi-vol-8afbd20e-721d-11ec-a974-52723c9f3c2f-temp@csi-vol-8afbd20e-721d-11ec-a974-52723c9f3c2f replicapool/csi-vol-8afbd20e-721d-11ec-a974-52723c9f3c2f (features: [layering]) using mon rook-ceph-mon-a.rook-ceph.svc.cluster.local:6789, data pool ec-pool
E0110 13:59:37.109442 1 snapshot.go:43] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 failed to clone rbd image csi-vol-8afbd20e-721d-11ec-a974-52723c9f3c2f from snapshot csi-vol-8afbd20e-721d-11ec-a974-52723c9f3c2f: failed to create rbd clone: RBD image not found
I0110 13:59:37.109489 1 rbd_util.go:1388] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 rbd: snap rm replicapool/csi-vol-8afbd20e-721d-11ec-a974-52723c9f3c2f-temp@csi-vol-8afbd20e-721d-11ec-a974-52723c9f3c2f using mon
E0110 13:59:40.654609 1 clone.go:158] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 failed to get volume id replicapool/csi-vol-8afbd20e-721d-11ec-a974-52723c9f3c2f: image not found: RBD image not found
I0110 13:59:40.678940 1 omap.go:123] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 removed omap keys (pool="replicapool", namespace="", name="csi.volumes.default"): [csi.volume.pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36]
E0110 13:59:40.679011 1 utils.go:200] ID: 152 Req-ID: pvc-d9153236-16a8-447f-bd2f-b2fa6c8f5d36 GRPC error: image not found: RBD image not found
since received error is rbd image not found, no cleanup of tmp clone is done
The image snap was being flattened due to
PersistentVolumeClaims.
Jan 10 14:00:36.031: INFO: At 2022-01-10 13:59:46 +0000 GMT - event for rbd-13188: {rbd.csi.ceph.com_csi-rbdplugin-provisioner-77698b6f6d-s4hc5_96ef1b57-52ed-402d-b7c1-08c9ff88927c } ProvisioningFailed: failed to provision volume with StorageClass "csi-rbd-sc": rpc error: code = ResourceExhausted desc = rbd image replicapool/csi-vol-7f6b1b8d-721d-11ec-a974-52723c9f3c2f has 5 snapshots
Jan 10 14:00:36.031: INFO: At 2022-01-10 13:59:49 +0000 GMT - event for rbd-13182: {rbd.csi.ceph.com_csi-rbdplugin-provisioner-77698b6f6d-s4hc5_96ef1b57-52ed-402d-b7c1-08c9ff88927c } ProvisioningFailed: failed to provision volume with StorageClass "csi-rbd-sc": rpc error: code = ResourceExhausted desc = rbd image replicapool/csi-vol-7f6b1b8d-721d-11ec-a974-52723c9f3c2f has 4 snapshots`
cc @ceph/ceph-csi-contributors
Ceph tracker : https://tracker.ceph.com/issues/52810
Ceph tracker : https://tracker.ceph.com/issues/52810
moving out of 3.6, depends on Ceph fix
https://tracker.ceph.com/issues/52810
This has got a fix in ceph master quincy pr https://tracker.ceph.com/issues/52810 pacific pr https://github.com/ceph/ceph/pull/47995
Next quincy release will have this fix.
https://tracker.ceph.com/issues/52810
This has got a fix in ceph master quincy pr https://tracker.ceph.com/issues/52810 pacific pr ceph/ceph#47995
Next quincy release will have this fix.
The latest qunicy release has the fix. cephcsi v3.7.2 is built on this image. Closing this issue as done.
https://github.com/ceph/ceph-csi/blob/cd2e25c290a642154c25c4bf42e739f39c1d51bd/e2e/rbd.go#L88 <- not sure but this line is marked as "set to 10 when #2327 is fixed". Should we set it to 10?!
https://github.com/ceph/ceph-csi/blob/cd2e25c290a642154c25c4bf42e739f39c1d51bd/e2e/rbd.go#L88
<- not sure but this line is marked as "set to 10 when #2327 is fixed". Should we set it to 10?!
Thanks for noticing. That was done to avoid hitting this issue in ci.
I think it is good idea to increase the count at one of the places where that variable is used to a higher number. Increasing it at all places may unnecessarily add too much delay to the ci.
@Spacefish do you want to send a pr to introduce a new variable set to 10 and use it in one of the places where the current default is used ?
Running
getImageInfo()
may returnErrImageNotFound
when rbd images are undergoing flattening operation. https://github.com/ceph/ceph-csi/blob/43f753760b0800beb023de066ff33e850fb4189e/internal/rbd/rbd_util.go#L1292We need to try to do cleanup when we encounter
ErrImageNotFound
since it maybe unreliable in such casesCeph exec logs
``` root@rook-ceph-tools-7b96766574-wh7sf /]# ceph rbd task add flatten replicapool/tmp {"sequence": 9, "id": "e2a2df32-3edb-48c4-b538-a41985232e99", "message": "Flattening image replicapool/tmp", "refs": {"action": "flatten", "pool_name": "replicapool", "pool_namespace": "", "image_name": "tmp", "image_id": "41c9e3608f0b"}} ^[[A[root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp rbd image 'tmp': size 1 GiB in 256 objects order 22 (4 MiB objects) snapshot_count: 0 id: 41c9e3608f0b block_name_prefix: rbd_data.41c9e3608f0b format: 2 features: layering, deep-flatten, operations op_features: clone-child flags: create_timestamp: Tue Jul 27 11:29:40 2021 access_timestamp: Tue Jul 27 11:29:40 2021 modify_timestamp: Tue Jul 27 11:29:40 2021 parent: replicapool/csi-vol-673a80bb-eea1-11eb-80f6-0242ac110006@18f674b3-62f5-4f3c-b248-add99476c0c0 overlap: 1 GiB [root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp rbd image 'tmp': size 1 GiB in 256 objects order 22 (4 MiB objects) snapshot_count: 0 id: 41c9e3608f0b block_name_prefix: rbd_data.41c9e3608f0b format: 2 features: layering, deep-flatten, operations op_features: clone-child flags: create_timestamp: Tue Jul 27 11:29:40 2021 access_timestamp: Tue Jul 27 11:29:40 2021 modify_timestamp: Tue Jul 27 11:29:40 2021 parent: replicapool/csi-vol-673a80bb-eea1-11eb-80f6-0242ac110006@18f674b3-62f5-4f3c-b248-add99476c0c0 overlap: 1 GiB [root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp rbd image 'tmp': size 1 GiB in 256 objects order 22 (4 MiB objects) snapshot_count: 0 id: 41c9e3608f0b block_name_prefix: rbd_data.41c9e3608f0b format: 2 features: layering, deep-flatten, operations op_features: clone-child flags: create_timestamp: Tue Jul 27 11:29:40 2021 access_timestamp: Tue Jul 27 11:29:40 2021 modify_timestamp: Tue Jul 27 11:29:40 2021 parent: replicapool/csi-vol-673a80bb-eea1-11eb-80f6-0242ac110006@18f674b3-62f5-4f3c-b248-add99476c0c0 overlap: 1 GiB [root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp rbd image 'tmp': size 1 GiB in 256 objects order 22 (4 MiB objects) snapshot_count: 0 id: 41c9e3608f0b block_name_prefix: rbd_data.41c9e3608f0b format: 2 features: layering, deep-flatten, operations op_features: clone-child flags: create_timestamp: Tue Jul 27 11:29:40 2021 access_timestamp: Tue Jul 27 11:29:40 2021 modify_timestamp: Tue Jul 27 11:29:40 2021 parent: replicapool/csi-vol-673a80bb-eea1-11eb-80f6-0242ac110006@18f674b3-62f5-4f3c-b248-add99476c0c0 overlap: 1 GiB [root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp rbd image 'tmp': size 1 GiB in 256 objects order 22 (4 MiB objects) snapshot_count: 0 id: 41c9e3608f0b block_name_prefix: rbd_data.41c9e3608f0b format: 2 features: layering, deep-flatten, operations op_features: clone-child flags: create_timestamp: Tue Jul 27 11:29:40 2021 access_timestamp: Tue Jul 27 11:29:40 2021 modify_timestamp: Tue Jul 27 11:29:40 2021 parent: replicapool/csi-vol-673a80bb-eea1-11eb-80f6-0242ac110006@18f674b3-62f5-4f3c-b248-add99476c0c0 overlap: 1 GiB [root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp rbd image 'tmp': size 1 GiB in 256 objects order 22 (4 MiB objects) snapshot_count: 0 id: 41c9e3608f0b block_name_prefix: rbd_data.41c9e3608f0b format: 2 features: layering, deep-flatten, operations op_features: clone-child flags: create_timestamp: Tue Jul 27 11:29:40 2021 access_timestamp: Tue Jul 27 11:29:40 2021 modify_timestamp: Tue Jul 27 11:29:40 2021 parent: replicapool/csi-vol-673a80bb-eea1-11eb-80f6-0242ac110006@18f674b3-62f5-4f3c-b248-add99476c0c0 overlap: 1 GiB [root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp rbd image 'tmp': size 1 GiB in 256 objects order 22 (4 MiB objects) snapshot_count: 0 id: 41c9e3608f0b block_name_prefix: rbd_data.41c9e3608f0b format: 2 features: layering, deep-flatten, operations op_features: clone-child flags: create_timestamp: Tue Jul 27 11:29:40 2021 access_timestamp: Tue Jul 27 11:29:40 2021 modify_timestamp: Tue Jul 27 11:29:40 2021 parent: replicapool/csi-vol-673a80bb-eea1-11eb-80f6-0242ac110006@18f674b3-62f5-4f3c-b248-add99476c0c0 overlap: 1 GiB [root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp rbd image 'tmp': size 1 GiB in 256 objects order 22 (4 MiB objects) snapshot_count: 0 id: 41c9e3608f0b block_name_prefix: rbd_data.41c9e3608f0b format: 2 features: layering, deep-flatten, operations op_features: clone-child flags: create_timestamp: Tue Jul 27 11:29:40 2021 access_timestamp: Tue Jul 27 11:29:40 2021 modify_timestamp: Tue Jul 27 11:29:40 2021 parent: replicapool/csi-vol-673a80bb-eea1-11eb-80f6-0242ac110006@18f674b3-62f5-4f3c-b248-add99476c0c0 overlap: 1 GiB [root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp rbd image 'tmp': size 1 GiB in 256 objects order 22 (4 MiB objects) snapshot_count: 0 id: 41c9e3608f0b block_name_prefix: rbd_data.41c9e3608f0b format: 2 features: layering, deep-flatten, operations op_features: clone-child flags: create_timestamp: Tue Jul 27 11:29:40 2021 access_timestamp: Tue Jul 27 11:29:40 2021 modify_timestamp: Tue Jul 27 11:29:40 2021 parent: replicapool/csi-vol-673a80bb-eea1-11eb-80f6-0242ac110006@18f674b3-62f5-4f3c-b248-add99476c0c0 overlap: 1 GiB [root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp 2021-07-27T11:31:04.642+0000 7f0802fef700 -1 librbd::image::OpenRequest: failed to set image snapshot: (2) No such file or directory 2021-07-27T11:31:04.642+0000 7f0802fef700 -1 librbd::image::RefreshParentRequest: failed to open parent image: (2) No such file or directory 2021-07-27T11:31:04.642+0000 7f0802fef700 -1 librbd::image::RefreshRequest: failed to refresh parent image: (2) No such file or directory 2021-07-27T11:31:04.642+0000 7f0802fef700 -1 librbd::image::OpenRequest: failed to refresh image: (2) No such file or directory rbd: error opening image tmp: (2) No such file or directory [root@rook-ceph-tools-7b96766574-wh7sf /]# rbd info replicapool/tmp rbd image 'tmp': size 1 GiB in 256 objects order 22 (4 MiB objects) snapshot_count: 0 id: 41c9e3608f0b block_name_prefix: rbd_data.41c9e3608f0b format: 2 features: layering, deep-flatten op_features: flags: create_timestamp: Tue Jul 27 11:29:40 2021 access_timestamp: Tue Jul 27 11:29:40 2021 modify_timestamp: Tue Jul 27 11:29:40 2021 ```This was the reason for failure at https://github.com/ceph/ceph-csi/issues/1883 and PVC-PVC clone stale images which I observed in a ci run.
Some of the scenarios this can occur: