IBM / ibm-spectrum-scale-csi

The IBM Spectrum Scale Container Storage Interface (CSI) project enables container orchestrators, such as Kubernetes and OpenShift, to manage the life-cycle of persistent storage.
Apache License 2.0
66 stars 49 forks source link

after restore snapshot, PVCs remain in pending state if number of parallel copy processes are at max level #383

Closed kulkarnicr closed 8 months ago

kulkarnicr commented 3 years ago

Describe the bug

try to restore snapshot to multiple pvcs at a time - we run into mmxcp limitation here - "Number of parallel copy processes '10' is currently at the maximum number for this cluster.". However, "kubectl describe" doesn't show any event for above limitation and just shows PVC in pending state. We need to improve the error message here for "kubectl describe".

To Reproduce Steps to reproduce the behavior:

  1. i was trying to restore snapshot to multiple pvcs in one go.. few pvcs remained in Pending state as mmxcp can run max 10 processes.

Expected behavior improve the error messages / events for failure cases mentioned above.

Environment Please run the following an paste your output here:

# Developement
operator-sdk version 
go version

# Deployment
kubectl version
rpm -qa | grep gpfs

[root@t-x-master 2021_03_05-03:38:23 test_snapshot]$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:28:09Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:20:00Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
[root@t-x-master 2021_03_05-03:38:24 test_snapshot]$ rpm -qa | grep gpfs

gpfs.base-5.1.1-0.210201.112049.x86_64
gpfs.license.dm-5.1.1-0.210201.112049.x86_64
gpfs.gss.pmcollector-5.1.1-0.el7.x86_64
gpfs.gskit-8.0.55-19.x86_64
gpfs.msg.en_US-5.1.1-0.210201.112049.noarch
gpfs.gpl-5.1.1-0.210201.112049.noarch
gpfs.adv-5.1.1-0.210201.112049.x86_64
gpfs.crypto-5.1.1-0.210201.112049.x86_64
gpfs.gss.pmsensors-5.1.1-0.el7.x86_64
gpfs.java-5.1.1-0.210201.112049.x86_64
gpfs.gui-5.1.1-0.210201.114540.noarch
gpfs.docs-5.1.1-0.210201.112049.noarch
gpfs.compression-5.1.1-0.210201.112049.x86_64

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

kulkarnicr commented 3 years ago

I conducted an extra test. restored a snapshot (having 20 files and not the 1 million files) to 30 pvcs at a time. observed that initially only 9 pvcs went to Bound state, rest 21 pvcs were in Pending state. Over the time, 21 pending pvcs started going to Bound state. So, it indicates that once existing mmapplypolicy commands complete, then the pending pvcs are picked up for copy operation (i.e. mmapplypolicy).

So, as of now, it looks we just need to improve event/message for pending pvcs here (which indicates that pvcs are pending because of mmxcp 10 processes limitation).

smitaraut commented 3 years ago

This limit is set at 2 places-

  1. "--worker-threads=10" in provisioner
  2. Max no. of parallel mmxcp processes which is also 10

If provisioner itself runs out of threads it will keep the pvcs in pending state while the control has not even reached to the CSI plugin. So there isnt much that the CSI driver can do to report this, but provisioner logs should give some indication.

Wondering when the mmxcp can hit its limit, because the provisioner itself will not send more than 10 requests at a time. One possibility is if some lost pending mmxcp job existed. But again this is an interim state and eventually the pvcs should go in bound state.

deeghuge commented 2 years ago

Duplicate of #361

saurabhwani5 commented 9 months ago

I tried restoring 30 pvc from single snapshot where data written is 1000 files and all files are getting restored in PVC and I no error message is shown which was coming above

root@saurabhmultiguiubu-master:~/saurabh/Upgradetesting# oc describe pvc ibm-spectrum-scale-pvc-from-snapshot-30
Name:          ibm-spectrum-scale-pvc-from-snapshot-30
Namespace:     ibm-spectrum-scale-csi-driver
StorageClass:  ibm-spectrum-scale-csi-advance
Status:        Bound
Volume:        pvc-a712004c-5ec4-42fd-b52a-fe2427b67095
Labels:        <none>
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: spectrumscale.csi.ibm.com
               volume.kubernetes.io/storage-provisioner: spectrumscale.csi.ibm.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      1Gi
Access Modes:  RWX
VolumeMode:    Filesystem
DataSource:
  APIGroup:  snapshot.storage.k8s.io
  Kind:      VolumeSnapshot
  Name:      ibm-spectrum-scale-snapshot
Used By:     <none>
Events:
  Type    Reason                 Age                 From                                                                                                               Message
  ----    ------                 ----                ----                                                                                                               -------
  Normal  Provisioning           36m                 spectrumscale.csi.ibm.com_ibm-spectrum-scale-csi-provisioner-c48d8df47-hxcdf_2c9fc09f-1f31-4f4c-a02b-28158b41d903  External provisioner is provisioning volume for claim "ibm-spectrum-scale-csi-driver/ibm-spectrum-scale-pvc-from-snapshot-30"
  Normal  ExternalProvisioning   36m (x26 over 42m)  persistentvolume-controller                                                                                        waiting for a volume to be created, either by external provisioner "spectrumscale.csi.ibm.com" or manually created by system administrator
  Normal  ProvisioningSucceeded  35m                 spectrumscale.csi.ibm.com_ibm-spectrum-scale-csi-provisioner-c48d8df47-hxcdf_2c9fc09f-1f31-4f4c-a02b-28158b41d903  Successfully provisioned volume pvc-a712004c-5ec4-42fd-b52a-fe2427b67095
deeghuge commented 8 months ago

As per above comment, issue is fixed and no longer getting recreated.