Closed Yuggupta27 closed 3 years ago
@Yuggupta27 looks like you have attached csi-cephfsplugin pod log, please attach the csi-cephfsplugin container logs of the provisioner pod.
On a second attempt to reproduce the issue...
[ygupta@localhost cephfs]$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
cephfs-pvc-clone1 Bound pvc-61f267a1-a636-404c-8d94-84ba5363142d 1Gi RWX csi-cephfs-sc 16m
cephfs-pvc-clone102892787266255242342232211 Bound pvc-f447148d-801a-4c2f-b4b5-102c57a3af09 1Gi RWX csi-cephfs-sc 16m
cephfs-pvc-clone1129102892787266255242342232211 Bound pvc-b96c84a6-8308-4f7e-ad9c-15854566c8d2 1Gi RWX csi-cephfs-sc 16m
cephfs-pvc-clone12301129102892787266255242342232211 Bound pvc-87a8c594-2c20-4efd-9c39-6e21617dad87 1Gi RWX csi-cephfs-sc 15m
cephfs-pvc-clone13323112301129102892787266255242342232211 Bound pvc-c0838a42-9740-4b44-af01-998202e2f80f 1Gi RWX csi-cephfs-sc 15m
cephfs-pvc-clone1413323112301129102892787266255242342232211 Bound pvc-f385f8e4-9b13-46b5-976d-d0c0a5a485ee 1Gi RWX csi-cephfs-sc 15m
cephfs-pvc-clone15331413323112301129102892787266255242342232211 Bound pvc-40a97e8c-8c3d-4d9b-a557-3106b60c9243 1Gi RWX csi-cephfs-sc 15m
cephfs-pvc-clone163415331413323112301129102892787266255242342232211 Bound pvc-97df7303-7e9c-45f9-ac22-a43e24222f6c 1Gi RWX csi-cephfs-sc 15m
cephfs-pvc-clone1735163415331413323112301129102892787266255242342232211 Bound pvc-757e7511-1f97-48a6-8aa7-43327373979a 1Gi RWX csi-cephfs-sc 15m
cephfs-pvc-clone18361735163415331413323112301129102892787266255242342232211 Bound pvc-a0f287ec-6452-4656-8978-e98f1ad789db 1Gi RWX csi-cephfs-sc 15m
cephfs-pvc-clone193718361735163415331413323112301129102892787266255242342232211 Bound pvc-c167dc59-87ff-4d4c-b3cc-c681ec893c48 1Gi RWX csi-cephfs-sc 15m
cephfs-pvc-clone2211 Bound pvc-ef83be74-1dd6-4940-a762-509b3d9b77b0 1Gi RWX csi-cephfs-sc 16m
cephfs-pvc-clone2232211 Bound pvc-db86ba32-15d9-419a-a353-206291d1b78f 1Gi RWX csi-cephfs-sc 16m
cephfs-pvc-clone2342232211 Bound pvc-4976d4fc-690f-41ef-b319-c295e95b58b6 1Gi RWX csi-cephfs-sc 16m
cephfs-pvc-clone242342232211 Bound pvc-c836b27d-be65-4639-9d18-62e1bc11073c 1Gi RWX csi-cephfs-sc 16m
cephfs-pvc-clone2787266255242342232211 Bound pvc-d5e6824a-54f0-4c23-a6af-7a81fe21441d 1Gi RWX csi-cephfs-sc 16m
cephfs-pvc-clone2892787266255242342232211 Bound pvc-2382f187-c4f8-4626-84a5-3f4d068b59c9 1Gi RWX csi-cephfs-sc 16m
cephfs-pvc-clone29102892787266255242342232211 Bound pvc-71219ad9-1d8d-4582-81f6-a9a9460f7bd6 1Gi RWX csi-cephfs-sc 16m
cephfs-pvc-clone301129102892787266255242342232211 Bound pvc-c04b2f88-f421-4381-8338-984bb9568a94 1Gi RWX csi-cephfs-sc 15m
cephfs-pvc-clone3112301129102892787266255242342232211 Bound pvc-8721881d-1204-4254-a602-1e8fe760004b 1Gi RWX csi-cephfs-sc 15m
cephfs-pvc-clone331413323112301129102892787266255242342232211 Bound pvc-76f586bb-d7fc-4177-bdb7-74bf122ba689 1Gi RWX csi-cephfs-sc 15m
cephfs-pvc-clone3415331413323112301129102892787266255242342232211 Bound pvc-515994b0-a23b-4ad4-bf7b-42e0562b3efe 1Gi RWX csi-cephfs-sc 15m
cephfs-pvc-clone35163415331413323112301129102892787266255242342232211 Bound pvc-92f4dbd0-eb36-4741-ae65-82ab1b448a18 1Gi RWX csi-cephfs-sc 15m
cephfs-pvc-clone361735163415331413323112301129102892787266255242342232211 Bound pvc-28190f41-33d5-4082-9932-8c7295dfe87b 1Gi RWX csi-cephfs-sc 15m
cephfs-pvc-clone3718361735163415331413323112301129102892787266255242342232211 Bound pvc-57b33bfc-cd2d-4998-84ec-2ea72a59621c 1Gi RWX csi-cephfs-sc 15m
cephfs-pvc-clone38193718361735163415331413323112301129102892787266255242342232211 Bound pvc-e7266d6f-d533-4453-a2f4-42e995118def 1Gi RWX csi-cephfs-sc 15m
cephfs-pvc-clone3938193718361735163415331413323112301129102892787266255242342232211 Bound pvc-77f11e05-fddc-480b-b93b-84ed0a457f1c 1Gi RWX csi-cephfs-sc 15m
cephfs-pvc-clone403938193718361735163415331413323112301129102892787266255242342232211 Bound pvc-443229dd-7ec7-44bf-ba20-196b6bc44eaa 1Gi RWX csi-cephfs-sc 15m
cephfs-pvc-clone42232211 Bound pvc-8ab86d5f-c019-427c-a21b-09f013bc2abf 1Gi RWX csi-cephfs-sc 16m
cephfs-pvc-clone5242342232211 Bound pvc-f1f61bc3-5ce7-4073-8531-3b0c9780684a 1Gi RWX csi-cephfs-sc 16m
cephfs-pvc-clone6255242342232211 Bound pvc-26da4827-c853-4330-9959-73585e941602 1Gi RWX csi-cephfs-sc 16m
cephfs-pvc-clone7266255242342232211 Bound pvc-0fa87714-280a-46df-b8b0-24aa20b8ea6d 1Gi RWX csi-cephfs-sc 16m
cephfs-pvc-clone92787266255242342232211 Bound pvc-b02780f2-8201-4276-af00-3493f3ad8a42 1Gi RWX csi-cephfs-sc 16m
csi-cephfs-pvc
[ygupta@localhost cephfs]$ kubectl get pvc | grep csi-cephfs-sc | wc -l
34
2. The subvolume count this time:
sh-4.2# ceph fs subvolume ls myfs mysubgrp | grep name
"name": "csi-vol-1998c5fb-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-195bf987-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-166537c7-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-134d7114-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-12294101-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-143156ef-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-14503b77-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-168e569b-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-14004609-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-1ae931fb-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-166e6a9a-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-057df06d-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-17c3350d-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-01478843-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-1876c60b-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-0f16a9b4-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-1a089691-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-167eb46c-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-1408d7fc-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-1a360ecf-12b0-11eb-a42a-0242ac11000e"
"name": "csi-vol-169e4d1a-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-fd76e71a-12ba-11eb-a42a-0242ac11000e"
"name": "csi-vol-16e42c18-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-18a78fe3-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-12e0f93a-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-0d311bf9-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-1199e328-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-18185a02-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-0803a2cd-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-1949adf6-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-14f71aed-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-0db559c0-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-092c63da-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-15e43f06-12bd-11eb-a42a-0242ac11000e"
"name": "csi-vol-0ab05bae-12bd-11eb-a42a-0242ac11000e"
sh-4.2# ceph fs subvolume ls myfs mysubgrp | grep name | wc -l 35
Logs:
1. [csi-provisioner](https://termbin.com/6ru0)
1. [csi-cephfsplugin](https://termbin.com/nmpy)
The clone from volume is retried (as the first call takes time to complete), and the call to ensure intermediate snapshot is deleted is failing with EINVAL (22) rather than the expected string failure, causing CreateVolume (which has actually succeeded, to never return a success.
So in this case, the clone is found correctly using the OMap journals, but attempts to ensure the intermediate snapshot to create the clone is cleaned up is failing in cleanupCloneFromSubvolumeSnapshot
-> getSnapshotInfo
.
This could be an artifact of the ceph version in play, in terms of the error returned (can be checked with a command using the toolbox).
The exact log line where the failure starts occurring is,
E1020 08:44:02.998228 1 snapshot.go:139] ID: 166 Req-ID: pvc-c0b06217-9d54-4829-b223-6b04dde13d5f failed to get subvolume snapshot info csi-vol-1a360ecf-12b0-11eb-a42a-0242ac11000e csi-vol-9b76f79e-12ae-11eb-a42a-0242ac11000e(an error (exit status 22) occurred while running ceph args: [fs subvolume snapshot info myfs csi-vol-9b76f79e-12ae-11eb-a42a-0242ac11000e csi-vol-1a360ecf-12b0-11eb-a42a-0242ac11000e --group_name mysubgrp -m 10.96.131.184:6789 -c /etc/ceph/ceph.conf -n client.admin --keyfile=***stripped*** --format=json]) in fs myfs
One interesting fact is we do not see a PVC that is in pending state for this create, as that would never have completed. Unsure why though.
The clone from volume is retried (as the first call takes time to complete), and the call to ensure intermediate snapshot is deleted is failing with EINVAL (22) rather than the expected string failure, causing CreateVolume (which has actually succeeded, to never return a success.
So in this case, the clone is found correctly using the OMap journals, but attempts to ensure the intermediate snapshot to create the clone is cleaned up is failing in
cleanupCloneFromSubvolumeSnapshot
->getSnapshotInfo
.This could be an artifact of the ceph version in play, in terms of the error returned (can be checked with a command using the toolbox).
The exact log line where the failure starts occurring is,
E1020 08:44:02.998228 1 snapshot.go:139] ID: 166 Req-ID: pvc-c0b06217-9d54-4829-b223-6b04dde13d5f failed to get subvolume snapshot info csi-vol-1a360ecf-12b0-11eb-a42a-0242ac11000e csi-vol-9b76f79e-12ae-11eb-a42a-0242ac11000e(an error (exit status 22) occurred while running ceph args: [fs subvolume snapshot info myfs csi-vol-9b76f79e-12ae-11eb-a42a-0242ac11000e csi-vol-1a360ecf-12b0-11eb-a42a-0242ac11000e --group_name mysubgrp -m 10.96.131.184:6789 -c /etc/ceph/ceph.conf -n client.admin --keyfile=***stripped*** --format=json]) in fs myfs
Thanks for looking into it @ShyamsundarR , The ceph version I used was v 14.2.8
which does not have the snapshot info
command. It might be failing due to that.
@humblec , do we have a version check when we run the check for snapshot info?
Also, I will try to reproduce the issue for ceph-master
again and verify if the issue persists.
The clone from volume is retried (as the first call takes time to complete), and the call to ensure intermediate snapshot is deleted is failing with EINVAL (22) rather than the expected string failure, causing CreateVolume (which has actually succeeded, to never return a success.
So in this case, the clone is found correctly using the OMap journals, but attempts to ensure the intermediate snapshot to create the clone is cleaned up is failing in
cleanupCloneFromSubvolumeSnapshot
->getSnapshotInfo
.This could be an artifact of the ceph version in play, in terms of the error returned (can be checked with a command using the toolbox).
The exact log line where the failure starts occurring is,
E1020 08:44:02.998228 1 snapshot.go:139] ID: 166 Req-ID: pvc-c0b06217-9d54-4829-b223-6b04dde13d5f failed to get subvolume snapshot info csi-vol-1a360ecf-12b0-11eb-a42a-0242ac11000e csi-vol-9b76f79e-12ae-11eb-a42a-0242ac11000e(an error (exit status 22) occurred while running ceph args: [fs subvolume snapshot info myfs csi-vol-9b76f79e-12ae-11eb-a42a-0242ac11000e csi-vol-1a360ecf-12b0-11eb-a42a-0242ac11000e --group_name mysubgrp -m 10.96.131.184:6789 -c /etc/ceph/ceph.conf -n client.admin --keyfile=***stripped*** --format=json]) in fs myfs
Tested another scenario where this issue is arising,
[ygupta@localhost cephfs]$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
cephfs-pvc-clone Pending csi-cephfs-sc 37m
csi-cephfs-pvc Bound pvc-2ecc5d49-2f89-412e-ad3a-a0e11480b85f 5Gi RWX csi-cephfs-sc 42m
[ygupta@localhost cephfs]$ kubectl describe pvc cephfs-pvc-clone
Name: cephfs-pvc-clone
Namespace: default
StorageClass: csi-cephfs-sc
Status: Pending
Volume:
Labels: <none>
Annotations: volume.beta.kubernetes.io/storage-provisioner: cephfs.csi.ceph.com
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
DataSource:
Kind: PersistentVolumeClaim
Name: csi-cephfs-pvc
Mounted By: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 37m (x5 over 37m) cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-7bfd557f7f-qhrbj_e80aaf94-1270-4b06-a22d-2c34b80fb4a7 failed to provision volume with StorageClass "csi-cephfs-sc": rpc error: code = Aborted desc = in progress
Warning ProvisioningFailed 8m49s (x10 over 37m) cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-7bfd557f7f-qhrbj_e80aaf94-1270-4b06-a22d-2c34b80fb4a7 failed to provision volume with StorageClass "csi-cephfs-sc": rpc error: code = Internal desc = an error (exit status 22) occurred while running ceph args: [fs subvolume snapshot info myfs csi-vol-8aaec6cf-1369-11eb-a42a-0242ac11000e csi-vol-2fc0fdb4-136a-11eb-a42a-0242ac11000e --group_name mysubgrp -m 10.96.131.184:6789 -c /etc/ceph/ceph.conf -n client.admin --keyfile=***stripped*** --format=json]
Normal Provisioning 3m49s (x16 over 38m) cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-7bfd557f7f-qhrbj_e80aaf94-1270-4b06-a22d-2c34b80fb4a7 External provisioner is provisioning volume for claim "default/cephfs-pvc-clone"
Normal ExternalProvisioning 3m (x142 over 38m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "cephfs.csi.ceph.com" or manually created by system administrator
[ygupta@localhost cephfs]$
It might be again as the snapshot info
command is called even though it is not supported in Ceph v14.2.8
cc @ShyamsundarR @Madhu-1
@Yuggupta27 Thanks for looking into it. this looks to be the issue. please send a patch to fix it. if snapshotinfo command is not present to try to unprotect the snapshot and make some decisions based on the error message.
@Yuggupta27 Thanks for looking into it. this looks to be the issue. please send a patch to fix it. if snapshotinfo command is not present to try to unprotect the snapshot and make some decisions based on the error message.
Sure @Madhu-1 :+1:
@Yuggupta27 looking at code the above issue exists but looks like it wont leak the omap data. please check and try to reproduce it again to see where else the omap data is getting leaked
This issue is now fixed by the following PRs: https://github.com/ceph/ceph-csi/pull/1674 , https://github.com/ceph/ceph-csi/pull/1671 and https://github.com/ceph/ceph-csi/pull/1660 . As the testing indicates that this issue doesn't exist anymore, closing this for now.
Describe the bug
On creation of multiple clones of a PVC in parallel (37 clones to be exact), the subvolume count came to be 39
Environment details
fuse
orkernel
. for rbd itskrbd
orrbd-nbd
) :Steps to reproduce
Steps to reproduce the behavior:
Create multiple parallel clones of a PVC
Actual results
Expected behavior
The number of subvolumes should not exceed the PVCs created.
Logs
csi-provisioner
https://termbin.com/8d5xcsi-cephfsplugin
https://termbin.com/8m8ozAdditional context
The PVCs created were of 1GB size with 10MB data.
Also, Description of the only Pending PVC