after restore snapshot, PVCs remain in pending state if number of parallel copy processes are at max level

kulkarnicr commented 3 years ago

Describe the bug

try to restore snapshot to multiple pvcs at a time - we run into mmxcp limitation here - "Number of parallel copy processes '10' is currently at the maximum number for this cluster.". However, "kubectl describe" doesn't show any event for above limitation and just shows PVC in pending state. We need to improve the error message here for "kubectl describe".

To Reproduce Steps to reproduce the behavior:

i was trying to restore snapshot to multiple pvcs in one go.. few pvcs remained in Pending state as mmxcp can run max 10 processes.

i observed that after few pvcs, mmxcp used to fail... when i checked the job id..it showed below

[root@t-x-master 2021_03_03-22:28:48 fs1]$ curl -k -u admin:Passw0rd -XGET -H content-type:application/json "https://t-x-master:443/scalemgmt/v2/jobs/1000000000285"
{
"jobs" : [ {
"jobId" : 1000000000285,
"status" : "FAILED",
"submitted" : "2021-03-03 22:27:00,155",
"completed" : "2021-03-03 22:27:00,962",
"runtime" : 807,
"request" : {
  "type" : "PUT",
  "url" : "/scalemgmt/v2/filesystems/fs2/filesets/pvc-e5766ec9-080b-4d2f-b24c-feaa64717360/snapshotCopy/snapshot-ac78da22-4977-4946-ba8c-98547f82cc0b/path/pvc-e5766ec9-080b-4d2f-b24c-feaa64717360-data"
},
"result" : {
  "progress" : [ ],
  "commands" : [ "mmxcp enable --source 'pvc-e5766ec9-080b-4d2f-b24c-feaa64717360-data' --snapshot 'fs2:pvc-e5766ec9-080b-4d2f-b24c-feaa64717360:snapshot-ac78da22-4977-4946-ba8c-98547f82cc0b' --target '/mnt/fs1/pvc-9e1865e3-fd55-4ed4-bb96-b00fa64dff4d/pvc-9e1865e3-fd55-4ed4-bb96-b00fa64dff4d-data' " ],
  "stdout" : [ ],
  "stderr" : [ "EFSSG0632C Command execution aborted." ],
  "exitCode" : 8
},
"pids" : [ ]
} ],
"status" : {
"code" : 200,
"message" : "The request finished successfully."
}

I tried running that mmxcp command manually

}[root@t-x-master 2021_03_03-22:28:55 fs1]$ mmxcp enable --source 'pvc-e5766ec9-080b-4d2f-b24c-feaa64717360-data' --snapshot 'f2:pvc-e5766ec9-080b-4d2f-b24c-feaa64717360:snapshot-ac78da22-4977-4946-ba8c-98547f82cc0b' --target '/mnt/fs1/pvc-9e1865e3-fd55-4ed4-bb96-b00fa64dff4d/pvc-9e1865e3-fd55-4ed4-bb96-b00fa64dff4d-data'
[E] Number of parallel copy processes '10' is currently at the maximum number for this cluster.
mmxcp: Command failed. Examine previous error messages to determine cause.

"kubectl describe" doesn't give any such message about max '10' processes allowed.

[root@t-x-master 2021_03_03-22:39:40 fs1]$ kn describe pvc res1-vs1-fs2-million-to-fs1                                         
Name:          res1-vs1-fs2-million-to-fs1
Namespace:     ibm-spectrum-scale-csi-driver
StorageClass:  sc1-fs1-million
Status:        Pending
Volume:
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner: spectrumscale.csi.ibm.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
DataSource:
APIGroup:  snapshot.storage.k8s.io
Kind:      VolumeSnapshot
Name:      vs1-fs2-million
Used By:     <none>
Events:
Type     Reason                Age                   From                                                                                                 Message
----     ------                ----                  ----                                                                                                 -------
Warning  ProvisioningFailed    12m                   spectrumscale.csi.ibm.com_ibm-spectrum-scale-csi-provisioner-0_c8fc59eb-e00d-44dc-b7ca-6f1da69bcc05  failed to provision volume with StorageClass "sc1-fs1-million": rpc error: code = Unknown desc = [EFSSG0632C Command execution aborted.]
Normal   Provisioning          4m11s (x10 over 13m)  spectrumscale.csi.ibm.com_ibm-spectrum-scale-csi-provisioner-0_c8fc59eb-e00d-44dc-b7ca-6f1da69bcc05  External provisioner is provisioning volume for claim "ibm-spectrum-scale-csi-driver/res1-vs1-fs2-million-to-fs1"
Warning  ProvisioningFailed    4m10s (x9 over 12m)   spectrumscale.csi.ibm.com_ibm-spectrum-scale-csi-provisioner-0_c8fc59eb-e00d-44dc-b7ca-6f1da69bcc05  failed to provision volume with StorageClass "sc1-fs1-million": rpc error: code = Internal desc = SSnapshot copy job had failed for snapshot: snapshot-ac78da22-4977-4946-ba8c-98547f82cc0b
Normal   ExternalProvisioning  3m11s (x42 over 13m)  persistentvolume-controller                                                                          waiting for a volume to be created, either by external provisioner "spectrumscale.csi.ibm.com" or manually created by system administrator
[root@t-x-master 2021_03_03-22:39:46 fs1]$

Expected behavior improve the error messages / events for failure cases mentioned above.

Environment Please run the following an paste your output here:

# Developement
operator-sdk version 
go version

# Deployment
kubectl version
rpm -qa | grep gpfs

[root@t-x-master 2021_03_05-03:38:23 test_snapshot]$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:28:09Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:20:00Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
[root@t-x-master 2021_03_05-03:38:24 test_snapshot]$ rpm -qa | grep gpfs

gpfs.base-5.1.1-0.210201.112049.x86_64
gpfs.license.dm-5.1.1-0.210201.112049.x86_64
gpfs.gss.pmcollector-5.1.1-0.el7.x86_64
gpfs.gskit-8.0.55-19.x86_64
gpfs.msg.en_US-5.1.1-0.210201.112049.noarch
gpfs.gpl-5.1.1-0.210201.112049.noarch
gpfs.adv-5.1.1-0.210201.112049.x86_64
gpfs.crypto-5.1.1-0.210201.112049.x86_64
gpfs.gss.pmsensors-5.1.1-0.el7.x86_64
gpfs.java-5.1.1-0.210201.112049.x86_64
gpfs.gui-5.1.1-0.210201.114540.noarch
gpfs.docs-5.1.1-0.210201.112049.noarch
gpfs.compression-5.1.1-0.210201.112049.x86_64

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

kulkarnicr commented 3 years ago

I conducted an extra test. restored a snapshot (having 20 files and not the 1 million files) to 30 pvcs at a time. observed that initially only 9 pvcs went to Bound state, rest 21 pvcs were in Pending state. Over the time, 21 pending pvcs started going to Bound state. So, it indicates that once existing mmapplypolicy commands complete, then the pending pvcs are picked up for copy operation (i.e. mmapplypolicy).

So, as of now, it looks we just need to improve event/message for pending pvcs here (which indicates that pvcs are pending because of mmxcp 10 processes limitation).

restored snapshot to 30 pvcs at a time.

[root@t-x-master 2021_03_08-04:51:57 test_snapshot]$ for i in $(seq 1 30); do kubectl -n ibm-spectrum-scale-csi-driver apply -f pvc${i}from-vs1-pvc51.yaml; done
persistentvolumeclaim/pvc1from-vs1-pvc51 created
persistentvolumeclaim/pvc2from-vs1-pvc51 created
persistentvolumeclaim/pvc3from-vs1-pvc51 created
persistentvolumeclaim/pvc4from-vs1-pvc51 created
persistentvolumeclaim/pvc5from-vs1-pvc51 created
persistentvolumeclaim/pvc6from-vs1-pvc51 created
persistentvolumeclaim/pvc7from-vs1-pvc51 created
persistentvolumeclaim/pvc8from-vs1-pvc51 created
persistentvolumeclaim/pvc9from-vs1-pvc51 created
persistentvolumeclaim/pvc10from-vs1-pvc51 created
persistentvolumeclaim/pvc11from-vs1-pvc51 created
persistentvolumeclaim/pvc12from-vs1-pvc51 created
persistentvolumeclaim/pvc13from-vs1-pvc51 created
persistentvolumeclaim/pvc14from-vs1-pvc51 created
persistentvolumeclaim/pvc15from-vs1-pvc51 created
persistentvolumeclaim/pvc16from-vs1-pvc51 created
persistentvolumeclaim/pvc17from-vs1-pvc51 created
persistentvolumeclaim/pvc18from-vs1-pvc51 created
persistentvolumeclaim/pvc19from-vs1-pvc51 created
persistentvolumeclaim/pvc20from-vs1-pvc51 created
persistentvolumeclaim/pvc21from-vs1-pvc51 created
persistentvolumeclaim/pvc22from-vs1-pvc51 created
persistentvolumeclaim/pvc23from-vs1-pvc51 created
persistentvolumeclaim/pvc24from-vs1-pvc51 created
persistentvolumeclaim/pvc25from-vs1-pvc51 created
persistentvolumeclaim/pvc26from-vs1-pvc51 created
persistentvolumeclaim/pvc27from-vs1-pvc51 created
persistentvolumeclaim/pvc28from-vs1-pvc51 created
persistentvolumeclaim/pvc29from-vs1-pvc51 created
persistentvolumeclaim/pvc30from-vs1-pvc51 created
[root@t-x-master 2021_03_08-04:52:30 test_snapshot]$

tracking how many bound/pending pvcs

[root@t-x-master 2021_03_08-05:21:47 test_snapshot]$ while [[ True ]] ; do echo $(date) == $(kubectl -n ibm-spectrum-scale-csi-driver get pvc | grep from-vs1-pvc51 | grep Bound | wc -l) Bound == $(kubectl -n ibm-spectrum-scale-csi-driver get pvc | grep from-vs1-pvc51 | grep Pending | wc -l) Pending; sleep 60; done
Mon Mar 8 05:21:56 PST 2021 == 13 Bound == 17 Pending
Mon Mar 8 05:22:57 PST 2021 == 13 Bound == 17 Pending
Mon Mar 8 05:40:03 PST 2021 == 13 Bound == 17 Pending
Mon Mar 8 05:41:03 PST 2021 == 14 Bound == 16 Pending
Mon Mar 8 05:42:03 PST 2021 == 14 Bound == 16 Pending
Mon Mar 8 05:52:07 PST 2021 == 14 Bound == 16 Pending
Mon Mar 8 05:53:07 PST 2021 == 15 Bound == 15 Pending
Mon Mar 8 05:54:08 PST 2021 == 15 Bound == 15 Pending
Mon Mar 8 05:55:08 PST 2021 == 15 Bound == 15 Pending
Mon Mar 8 05:56:09 PST 2021 == 15 Bound == 15 Pending
Mon Mar 8 06:28:53 PST 2021 == 21 Bound == 9 Pending
Mon Mar 8 06:39:58 PST 2021 == 21 Bound == 9 Pending
Mon Mar 8 06:40:58 PST 2021 == 22 Bound == 8 Pending
Mon Mar 8 06:41:58 PST 2021 == 22 Bound == 8 Pending
Mon Mar 8 06:42:58 PST 2021 == 23 Bound == 7 Pending
Mon Mar 8 06:43:59 PST 2021 == 23 Bound == 7 Pending
Mon Mar 8 06:44:59 PST 2021 == 23 Bound == 7 Pending
Mon Mar 8 06:52:02 PST 2021 == 23 Bound == 7 Pending
Mon Mar 8 06:53:02 PST 2021 == 24 Bound == 6 Pending
Mon Mar 8 06:54:02 PST 2021 == 24 Bound == 6 Pending
Mon Mar 8 06:55:03 PST 2021 == 25 Bound == 5 Pending
Mon Mar 8 06:56:03 PST 2021 == 25 Bound == 5 Pending

smitaraut commented 3 years ago

This limit is set at 2 places-

"--worker-threads=10" in provisioner
Max no. of parallel mmxcp processes which is also 10

If provisioner itself runs out of threads it will keep the pvcs in pending state while the control has not even reached to the CSI plugin. So there isnt much that the CSI driver can do to report this, but provisioner logs should give some indication.

Wondering when the mmxcp can hit its limit, because the provisioner itself will not send more than 10 requests at a time. One possibility is if some lost pending mmxcp job existed. But again this is an interim state and eventually the pvcs should go in bound state.

deeghuge commented 2 years ago

Duplicate of #361

saurabhwani5 commented 9 months ago

I tried restoring 30 pvc from single snapshot where data written is 1000 files and all files are getting restored in PVC and I no error message is shown which was coming above

root@saurabhmultiguiubu-master:~/saurabh/Upgradetesting# oc describe pvc ibm-spectrum-scale-pvc-from-snapshot-30
Name:          ibm-spectrum-scale-pvc-from-snapshot-30
Namespace:     ibm-spectrum-scale-csi-driver
StorageClass:  ibm-spectrum-scale-csi-advance
Status:        Bound
Volume:        pvc-a712004c-5ec4-42fd-b52a-fe2427b67095
Labels:        <none>
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: spectrumscale.csi.ibm.com
               volume.kubernetes.io/storage-provisioner: spectrumscale.csi.ibm.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      1Gi
Access Modes:  RWX
VolumeMode:    Filesystem
DataSource:
  APIGroup:  snapshot.storage.k8s.io
  Kind:      VolumeSnapshot
  Name:      ibm-spectrum-scale-snapshot
Used By:     <none>
Events:
  Type    Reason                 Age                 From                                                                                                               Message
  ----    ------                 ----                ----                                                                                                               -------
  Normal  Provisioning           36m                 spectrumscale.csi.ibm.com_ibm-spectrum-scale-csi-provisioner-c48d8df47-hxcdf_2c9fc09f-1f31-4f4c-a02b-28158b41d903  External provisioner is provisioning volume for claim "ibm-spectrum-scale-csi-driver/ibm-spectrum-scale-pvc-from-snapshot-30"
  Normal  ExternalProvisioning   36m (x26 over 42m)  persistentvolume-controller                                                                                        waiting for a volume to be created, either by external provisioner "spectrumscale.csi.ibm.com" or manually created by system administrator
  Normal  ProvisioningSucceeded  35m                 spectrumscale.csi.ibm.com_ibm-spectrum-scale-csi-provisioner-c48d8df47-hxcdf_2c9fc09f-1f31-4f4c-a02b-28158b41d903  Successfully provisioned volume pvc-a712004c-5ec4-42fd-b52a-fe2427b67095

deeghuge commented 8 months ago

As per above comment, issue is fixed and no longer getting recreated.

IBM / ibm-spectrum-scale-csi

after restore snapshot, PVCs remain in pending state if number of parallel copy processes are at max level #383