Closed kulkarnicr closed 8 months ago
I conducted an extra test. restored a snapshot (having 20 files and not the 1 million files) to 30 pvcs at a time. observed that initially only 9 pvcs went to Bound state, rest 21 pvcs were in Pending state. Over the time, 21 pending pvcs started going to Bound state. So, it indicates that once existing mmapplypolicy commands complete, then the pending pvcs are picked up for copy operation (i.e. mmapplypolicy).
So, as of now, it looks we just need to improve event/message for pending pvcs here (which indicates that pvcs are pending because of mmxcp 10 processes limitation).
restored snapshot to 30 pvcs at a time.
[root@t-x-master 2021_03_08-04:51:57 test_snapshot]$ for i in $(seq 1 30); do kubectl -n ibm-spectrum-scale-csi-driver apply -f pvc${i}from-vs1-pvc51.yaml; done
persistentvolumeclaim/pvc1from-vs1-pvc51 created
persistentvolumeclaim/pvc2from-vs1-pvc51 created
persistentvolumeclaim/pvc3from-vs1-pvc51 created
persistentvolumeclaim/pvc4from-vs1-pvc51 created
persistentvolumeclaim/pvc5from-vs1-pvc51 created
persistentvolumeclaim/pvc6from-vs1-pvc51 created
persistentvolumeclaim/pvc7from-vs1-pvc51 created
persistentvolumeclaim/pvc8from-vs1-pvc51 created
persistentvolumeclaim/pvc9from-vs1-pvc51 created
persistentvolumeclaim/pvc10from-vs1-pvc51 created
persistentvolumeclaim/pvc11from-vs1-pvc51 created
persistentvolumeclaim/pvc12from-vs1-pvc51 created
persistentvolumeclaim/pvc13from-vs1-pvc51 created
persistentvolumeclaim/pvc14from-vs1-pvc51 created
persistentvolumeclaim/pvc15from-vs1-pvc51 created
persistentvolumeclaim/pvc16from-vs1-pvc51 created
persistentvolumeclaim/pvc17from-vs1-pvc51 created
persistentvolumeclaim/pvc18from-vs1-pvc51 created
persistentvolumeclaim/pvc19from-vs1-pvc51 created
persistentvolumeclaim/pvc20from-vs1-pvc51 created
persistentvolumeclaim/pvc21from-vs1-pvc51 created
persistentvolumeclaim/pvc22from-vs1-pvc51 created
persistentvolumeclaim/pvc23from-vs1-pvc51 created
persistentvolumeclaim/pvc24from-vs1-pvc51 created
persistentvolumeclaim/pvc25from-vs1-pvc51 created
persistentvolumeclaim/pvc26from-vs1-pvc51 created
persistentvolumeclaim/pvc27from-vs1-pvc51 created
persistentvolumeclaim/pvc28from-vs1-pvc51 created
persistentvolumeclaim/pvc29from-vs1-pvc51 created
persistentvolumeclaim/pvc30from-vs1-pvc51 created
[root@t-x-master 2021_03_08-04:52:30 test_snapshot]$
tracking how many bound/pending pvcs
[root@t-x-master 2021_03_08-05:21:47 test_snapshot]$ while [[ True ]] ; do echo $(date) == $(kubectl -n ibm-spectrum-scale-csi-driver get pvc | grep from-vs1-pvc51 | grep Bound | wc -l) Bound == $(kubectl -n ibm-spectrum-scale-csi-driver get pvc | grep from-vs1-pvc51 | grep Pending | wc -l) Pending; sleep 60; done
Mon Mar 8 05:21:56 PST 2021 == 13 Bound == 17 Pending
Mon Mar 8 05:22:57 PST 2021 == 13 Bound == 17 Pending
Mon Mar 8 05:40:03 PST 2021 == 13 Bound == 17 Pending
Mon Mar 8 05:41:03 PST 2021 == 14 Bound == 16 Pending
Mon Mar 8 05:42:03 PST 2021 == 14 Bound == 16 Pending
Mon Mar 8 05:52:07 PST 2021 == 14 Bound == 16 Pending
Mon Mar 8 05:53:07 PST 2021 == 15 Bound == 15 Pending
Mon Mar 8 05:54:08 PST 2021 == 15 Bound == 15 Pending
Mon Mar 8 05:55:08 PST 2021 == 15 Bound == 15 Pending
Mon Mar 8 05:56:09 PST 2021 == 15 Bound == 15 Pending
Mon Mar 8 06:28:53 PST 2021 == 21 Bound == 9 Pending
Mon Mar 8 06:39:58 PST 2021 == 21 Bound == 9 Pending
Mon Mar 8 06:40:58 PST 2021 == 22 Bound == 8 Pending
Mon Mar 8 06:41:58 PST 2021 == 22 Bound == 8 Pending
Mon Mar 8 06:42:58 PST 2021 == 23 Bound == 7 Pending
Mon Mar 8 06:43:59 PST 2021 == 23 Bound == 7 Pending
Mon Mar 8 06:44:59 PST 2021 == 23 Bound == 7 Pending
Mon Mar 8 06:52:02 PST 2021 == 23 Bound == 7 Pending
Mon Mar 8 06:53:02 PST 2021 == 24 Bound == 6 Pending
Mon Mar 8 06:54:02 PST 2021 == 24 Bound == 6 Pending
Mon Mar 8 06:55:03 PST 2021 == 25 Bound == 5 Pending
Mon Mar 8 06:56:03 PST 2021 == 25 Bound == 5 Pending
This limit is set at 2 places-
If provisioner itself runs out of threads it will keep the pvcs in pending state while the control has not even reached to the CSI plugin. So there isnt much that the CSI driver can do to report this, but provisioner logs should give some indication.
Wondering when the mmxcp can hit its limit, because the provisioner itself will not send more than 10 requests at a time. One possibility is if some lost pending mmxcp job existed. But again this is an interim state and eventually the pvcs should go in bound state.
Duplicate of #361
I tried restoring 30 pvc from single snapshot where data written is 1000 files and all files are getting restored in PVC and I no error message is shown which was coming above
root@saurabhmultiguiubu-master:~/saurabh/Upgradetesting# oc describe pvc ibm-spectrum-scale-pvc-from-snapshot-30
Name: ibm-spectrum-scale-pvc-from-snapshot-30
Namespace: ibm-spectrum-scale-csi-driver
StorageClass: ibm-spectrum-scale-csi-advance
Status: Bound
Volume: pvc-a712004c-5ec4-42fd-b52a-fe2427b67095
Labels: <none>
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: spectrumscale.csi.ibm.com
volume.kubernetes.io/storage-provisioner: spectrumscale.csi.ibm.com
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 1Gi
Access Modes: RWX
VolumeMode: Filesystem
DataSource:
APIGroup: snapshot.storage.k8s.io
Kind: VolumeSnapshot
Name: ibm-spectrum-scale-snapshot
Used By: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Provisioning 36m spectrumscale.csi.ibm.com_ibm-spectrum-scale-csi-provisioner-c48d8df47-hxcdf_2c9fc09f-1f31-4f4c-a02b-28158b41d903 External provisioner is provisioning volume for claim "ibm-spectrum-scale-csi-driver/ibm-spectrum-scale-pvc-from-snapshot-30"
Normal ExternalProvisioning 36m (x26 over 42m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "spectrumscale.csi.ibm.com" or manually created by system administrator
Normal ProvisioningSucceeded 35m spectrumscale.csi.ibm.com_ibm-spectrum-scale-csi-provisioner-c48d8df47-hxcdf_2c9fc09f-1f31-4f4c-a02b-28158b41d903 Successfully provisioned volume pvc-a712004c-5ec4-42fd-b52a-fe2427b67095
As per above comment, issue is fixed and no longer getting recreated.
Describe the bug
try to restore snapshot to multiple pvcs at a time - we run into mmxcp limitation here - "Number of parallel copy processes '10' is currently at the maximum number for this cluster.". However, "kubectl describe" doesn't show any event for above limitation and just shows PVC in pending state. We need to improve the error message here for "kubectl describe".
To Reproduce Steps to reproduce the behavior:
i observed that after few pvcs, mmxcp used to fail... when i checked the job id..it showed below
I tried running that mmxcp command manually
"kubectl describe" doesn't give any such message about max '10' processes allowed.
Expected behavior improve the error messages / events for failure cases mentioned above.
Environment Please run the following an paste your output here:
Screenshots If applicable, add screenshots to help explain your problem.
Additional context Add any other context about the problem here.