IBM / ibm-spectrum-scale-csi

The IBM Spectrum Scale Container Storage Interface (CSI) project enables container orchestrators, such as Kubernetes and OpenShift, to manage the life-cycle of persistent storage.
Apache License 2.0
66 stars 49 forks source link

PVC Clone Stuck Pending #843

Closed Tristan-Le1 closed 7 months ago

Tristan-Le1 commented 1 year ago

Bug Description

10 fileset PVC were created successfully and are in Bound state.

PVC cloning was working. Note that three PVCs cloned from 280-f-fileset-dplmnt1-pvc1 named 280-f-fileset-dplmnt1-pvc1-clone-(number) are Bound.

The fourth cloned fileset PVC (280-f-fileset-dplmnt1-pvc1-clone-4) has been stuck in Pending state for over 2 hours.

NAMESPACE         NAME                                 STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                       AGE
280-f-namespace   280-f-fileset-dplmnt1-pvc1           Bound     pvc-b190dbe6-9a96-4b26-9ed7-f37d2dea46cb   1Gi        RWX            280-f-fileset-csi-spectrum-scale   155m
280-f-namespace   280-f-fileset-dplmnt1-pvc1-clone-1   Bound     pvc-544530a6-4704-45e5-b457-9a0a9c2328f8   1Gi        RWX            280-f-fileset-csi-spectrum-scale   151m
280-f-namespace   280-f-fileset-dplmnt1-pvc1-clone-2   Bound     pvc-c8f779dd-e63c-4411-abed-899a63dc24f8   1Gi        RWX            280-f-fileset-csi-spectrum-scale   150m
280-f-namespace   280-f-fileset-dplmnt1-pvc1-clone-3   Bound     pvc-a51e4109-e8be-4922-a570-c62844ea0c0c   1Gi        RWX            280-f-fileset-csi-spectrum-scale   150m
280-f-namespace   280-f-fileset-dplmnt1-pvc1-clone-4   Pending                                                                        280-f-fileset-csi-spectrum-scale   150m
280-f-namespace   280-f-fileset-dplmnt1-pvc2           Bound     pvc-5428166e-296e-44bc-95c8-af58758c0c4a   1Gi        RWX            280-f-fileset-csi-spectrum-scale   155m
280-f-namespace   280-f-fileset-dplmnt2-pvc1           Bound     pvc-92001e9b-3d28-4ba3-8f4b-75023a7138a0   1Gi        RWX            280-f-fileset-csi-spectrum-scale   154m
280-f-namespace   280-f-fileset-dplmnt2-pvc2           Bound     pvc-0026f214-b93f-4603-933c-19bdba5ab279   1Gi        RWX            280-f-fileset-csi-spectrum-scale   154m
280-f-namespace   280-f-fileset-dplmnt3-pvc1           Bound     pvc-2abc485d-d483-4d92-8f5b-48364c5283ab   1Gi        RWX            280-f-fileset-csi-spectrum-scale   154m
280-f-namespace   280-f-fileset-dplmnt3-pvc2           Bound     pvc-8bae9e63-22b3-4c07-883b-70a738e962ec   1Gi        RWX            280-f-fileset-csi-spectrum-scale   154m
280-f-namespace   280-f-fileset-dplmnt4-pvc1           Bound     pvc-c722c0e0-496d-4b02-a536-1d5d4a6ff5ce   1Gi        RWX            280-f-fileset-csi-spectrum-scale   153m
280-f-namespace   280-f-fileset-dplmnt4-pvc2           Bound     pvc-54cb8e3e-82c5-412b-b5c1-9a788f776ed0   1Gi        RWX            280-f-fileset-csi-spectrum-scale   153m
280-f-namespace   280-f-fileset-dplmnt5-pvc1           Bound     pvc-f528dafb-82e2-4759-8395-bb51b994edad   1Gi        RWX            280-f-fileset-csi-spectrum-scale   152m
280-f-namespace   280-f-fileset-dplmnt5-pvc2           Bound     pvc-de34b058-ca82-4c42-b39b-1a9c54f287e4   1Gi        RWX            280-f-fileset-csi-spectrum-scale   152m

A describe of the stuck PVC clone shows these events:

  Type    Reason                Age                  From                                                                                                                Message
  ----    ------                ----                 ----                                                                                                                -------
  Normal  Provisioning          26s (x52 over 165m)  External provisioner is provisioning volume for claim "280-f-namespace/280-f-fileset-dplmnt1-pvc1-clone-4"
  Normal  ExternalProvisioning  3s (x662 over 165m)  persistentvolume-controller                                                                                         waiting for a volume to be created, either by external provisioner "" or manually created by system administrator

To Reproduce

Scripts used to create PVC are resident on the system (Archie), PWD: /ibm/fs0/real-world-tests/launchers/creation_of_pvc_and_deployments

The script can be launched to create fileset PVC with a command as such: ./ -f fs0 -g (GUI IP) -u (GUI Username):(GUI Password) -P fileset-10-Mi -t blast-write-loop -n 280-f -d 5 -p 2 -s t

Scripts used to clone PVC are resident on the system (Archie), PWD: /ibm/fs0/real-world-tests/launchers/cloning

The script can be launched to clone PVC with a command as such: ./ -n 280-f -c 280-f-fileset-csi-spectrum-scale -N 30 -P 10

Expected Behavior

The cloned PVC should become Bound.


Scale State:

 Node number  Node name       GPFS state  
           1  archiensd01-40  active
           5  archieprt03-40  active
           6  archieprt04-40  active
           9  archieprt08-40  active
          10  archieprt09-40  active
          11  archieprt05-40  active
          12  archieprt06-40  active
          14  archiensd02-40  active

Scale Health:

Node name:      archiensd01-40
Node status:    TIPS
Status Change:  23 hours ago

Component      Status        Status Change     Reasons & Notices
GPFS           TIPS          23 hours ago      callhome_not_enabled
NETWORK        HEALTHY       1 day ago         -
FILESYSTEM     HEALTHY       1 day ago         -
DISK           HEALTHY       1 day ago         -
FILESYSMGR     HEALTHY       1 day ago         -
PERFMON        HEALTHY       1 day ago         -
THRESHOLD      HEALTHY       1 day ago         -

Node name:      archiensd02-40
Node status:    HEALTHY
Status Change:  23 hours ago

Component      Status        Status Change     Reasons & Notices
GPFS           HEALTHY       23 hours ago      -
NETWORK        HEALTHY       1 day ago         -
FILESYSTEM     HEALTHY       1 day ago         -
DISK           HEALTHY       1 day ago         -
PERFMON        HEALTHY       1 day ago         -
THRESHOLD      HEALTHY       1 day ago         -

Node name:      archieprt03-40
Node status:    HEALTHY
Status Change:  23 hours ago

Component      Status        Status Change     Reasons & Notices
GPFS           HEALTHY       23 hours ago      -
NETWORK        HEALTHY       1 day ago         -
FILESYSTEM     HEALTHY       1 day ago         -
PERFMON        HEALTHY       1 day ago         -
THRESHOLD      HEALTHY       1 day ago         -

Node name:      archieprt04-40
Node status:    HEALTHY
Status Change:  23 hours ago

Component      Status        Status Change     Reasons & Notices
GPFS           HEALTHY       23 hours ago      -
NETWORK        HEALTHY       1 day ago         -
FILESYSTEM     HEALTHY       1 day ago         -
PERFMON        HEALTHY       1 day ago         -
THRESHOLD      HEALTHY       1 day ago         -

Node name:      archieprt05-40
Node status:    HEALTHY
Status Change:  23 hours ago

Component      Status        Status Change     Reasons & Notices
GPFS           HEALTHY       23 hours ago      -
NETWORK        HEALTHY       1 day ago         -
FILESYSTEM     HEALTHY       1 day ago         -
PERFMON        HEALTHY       1 day ago         -
THRESHOLD      HEALTHY       1 day ago         -

Node name:      archieprt06-40
Node status:    HEALTHY
Status Change:  23 hours ago

Component      Status        Status Change     Reasons & Notices
GPFS           HEALTHY       23 hours ago      -
NETWORK        HEALTHY       1 day ago         -
FILESYSTEM     HEALTHY       1 day ago         -
PERFMON        HEALTHY       1 day ago         -
THRESHOLD      HEALTHY       1 day ago         -

Node name:      archieprt08-40
Node status:    HEALTHY
Status Change:  23 hours ago

Component      Status        Status Change     Reasons & Notices
GPFS           HEALTHY       23 hours ago      -
NETWORK        HEALTHY       23 hours ago      -
FILESYSTEM     HEALTHY       23 hours ago      -
GUI            HEALTHY       21 hours ago      -
PERFMON        HEALTHY       23 hours ago      -
THRESHOLD      HEALTHY       23 hours ago      -

Node name:      archieprt09-40
Node status:    HEALTHY
Status Change:  21 hours ago

Component      Status        Status Change     Reasons & Notices
GPFS           HEALTHY       21 hours ago      -
NETWORK        HEALTHY       21 hours ago      -
FILESYSTEM     HEALTHY       21 hours ago      -
GUI            HEALTHY       21 hours ago      -
PERFMON        HEALTHY       21 hours ago      -
THRESHOLD      HEALTHY       21 hours ago      -

Kubernetes State:

NAME          STATUS   ROLES                  AGE    VERSION   LABELS
archieprt03   Ready    control-plane,master   419d   v1.23.1,,,,,,,
archieprt04   Ready    <none>                 419d   v1.23.1,,,,,scale=true
archieprt05   Ready    <none>                 419d   v1.23.1,,,,,scale=true
archieprt06   Ready    <none>                 419d   v1.23.1,,,,,scale=true

CSI State:

NAME                                                      READY   STATUS    RESTARTS   AGE
pod/ibm-spectrum-scale-csi-69ns9                          3/3     Running   0          26h
pod/ibm-spectrum-scale-csi-attacher-c47bd8698-gnznh       1/1     Running   0          26h
pod/ibm-spectrum-scale-csi-attacher-c47bd8698-zfk6n       1/1     Running   0          26h
pod/ibm-spectrum-scale-csi-operator-6dd549cdc5-hkvs7      1/1     Running   0          27h
pod/ibm-spectrum-scale-csi-provisioner-5bfc9878c9-jb7nx   1/1     Running   0          26h
pod/ibm-spectrum-scale-csi-resizer-55d86955b7-z9wvx       1/1     Running   0          26h
pod/ibm-spectrum-scale-csi-snapshotter-8555c6b66d-7mccz   1/1     Running   0          26h
pod/ibm-spectrum-scale-csi-wxtbz                          3/3     Running   0          26h
pod/ibm-spectrum-scale-csi-xmsqn                          3/3     Running   0          26h

NAME                                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/ibm-spectrum-scale-csi   3         3         3       3            3           scale=true      26h

NAME                                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ibm-spectrum-scale-csi-attacher      2/2     2            2           26h
deployment.apps/ibm-spectrum-scale-csi-operator      1/1     1            1           27h
deployment.apps/ibm-spectrum-scale-csi-provisioner   1/1     1            1           26h
deployment.apps/ibm-spectrum-scale-csi-resizer       1/1     1            1           26h
deployment.apps/ibm-spectrum-scale-csi-snapshotter   1/1     1            1           26h

NAME                                                            DESIRED   CURRENT   READY   AGE
replicaset.apps/ibm-spectrum-scale-csi-attacher-c47bd8698       2         2         2       26h
replicaset.apps/ibm-spectrum-scale-csi-operator-6dd549cdc5      1         1         1       27h
replicaset.apps/ibm-spectrum-scale-csi-provisioner-5bfc9878c9   1         1         1       26h
replicaset.apps/ibm-spectrum-scale-csi-resizer-55d86955b7       1         1         1       26h
replicaset.apps/ibm-spectrum-scale-csi-snapshotter-8555c6b66d   1         1         1       26h

Red Hat Version, Kernel, and Scale Version:

archiensd01-40:  CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server"
archiensd01-40:  Linux archiensd01 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
archiensd01-40:  Source RPM  : gpfs.base-5.1.6-0.src.rpm
archiensd02-40:  CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server"
archiensd02-40:  Linux archiensd02 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Aug 31 20:15:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
archiensd02-40:  Source RPM  : gpfs.base-5.1.6-0.src.rpm
archieprt03-40:  CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server"
archieprt03-40:  Linux archieprt03 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
archieprt03-40:  Source RPM  : gpfs.base-5.1.6-0.src.rpm
archieprt04-40:  CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server"
archieprt04-40:  Linux archieprt04 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
archieprt04-40:  Source RPM  : gpfs.base-5.1.6-0.src.rpm
archieprt05-40:  CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server"
archieprt05-40:  Linux archieprt05 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
archieprt05-40:  Source RPM  : gpfs.base-5.1.6-0.src.rpm
archieprt06-40:  CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server"
archieprt06-40:  Linux archieprt06 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
archieprt06-40:  Source RPM  : gpfs.base-5.1.6-0.src.rpm
archieprt08-40:  CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server"
archieprt08-40:  Linux archieprt08 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
archieprt08-40:  Source RPM  : gpfs.base-5.1.6-0.src.rpm
archieprt09-40:  CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server"
archieprt09-40:  Linux archieprt09 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
archieprt09-40:  Source RPM  : gpfs.base-5.1.6-0.src.rpm

Additional Context

A snap of the logs is resident on the system (Archie), PWD: /ibm/fs0/CSI/CSI-2.8.0-301122/ibm-spectrum-scale-csi/tools/ibm-spectrum-scale-csi-logs_12-01-2022-13\:48\:09/

amdabhad commented 1 year ago

This issue is due to mmxcp failure, and there is a known issue on this - cloning fails when mmxcp fails, user has to delete the PVC and retry cloning in this case.

checking more on why mmxcp is failing with:

[EFSSA0069C Command execution error: [E] Summary of errors:: _bunches of PDRs with errors:2.
Tristan-Le1 commented 1 year ago

In attempting to delete the offending PVC, I was able to delete the stuck Pending and Bound cloned PVC, but the original PVC (that these were cloned from, 280-f-fileset-dplmnt1-pvc1) is stuck in Terminating state for over 20 minutes.

NAMESPACE         NAME                         STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                       AGE
280-f-namespace   280-f-fileset-dplmnt1-pvc1   Terminating   pvc-b190dbe6-9a96-4b26-9ed7-f37d2dea46cb   1Gi        RWX            280-f-fileset-csi-spectrum-scale   24h
280-f-namespace   280-f-fileset-dplmnt1-pvc2   Bound         pvc-5428166e-296e-44bc-95c8-af58758c0c4a   1Gi        RWX            280-f-fileset-csi-spectrum-scale   24h
280-f-namespace   280-f-fileset-dplmnt2-pvc1   Bound         pvc-92001e9b-3d28-4ba3-8f4b-75023a7138a0   1Gi        RWX            280-f-fileset-csi-spectrum-scale   24h
280-f-namespace   280-f-fileset-dplmnt2-pvc2   Bound         pvc-0026f214-b93f-4603-933c-19bdba5ab279   1Gi        RWX            280-f-fileset-csi-spectrum-scale   24h
280-f-namespace   280-f-fileset-dplmnt3-pvc1   Bound         pvc-2abc485d-d483-4d92-8f5b-48364c5283ab   1Gi        RWX            280-f-fileset-csi-spectrum-scale   24h
280-f-namespace   280-f-fileset-dplmnt3-pvc2   Bound         pvc-8bae9e63-22b3-4c07-883b-70a738e962ec   1Gi        RWX            280-f-fileset-csi-spectrum-scale   24h
280-f-namespace   280-f-fileset-dplmnt4-pvc1   Bound         pvc-c722c0e0-496d-4b02-a536-1d5d4a6ff5ce   1Gi        RWX            280-f-fileset-csi-spectrum-scale   24h
280-f-namespace   280-f-fileset-dplmnt4-pvc2   Bound         pvc-54cb8e3e-82c5-412b-b5c1-9a788f776ed0   1Gi        RWX            280-f-fileset-csi-spectrum-scale   24h
280-f-namespace   280-f-fileset-dplmnt5-pvc1   Bound         pvc-f528dafb-82e2-4759-8395-bb51b994edad   1Gi        RWX            280-f-fileset-csi-spectrum-scale   24h
280-f-namespace   280-f-fileset-dplmnt5-pvc2   Bound         pvc-de34b058-ca82-4c42-b39b-1a9c54f287e4   1Gi        RWX            280-f-fileset-csi-spectrum-scale   24h

Is there a suggested course of action here?

amdabhad commented 1 year ago

It should get deleted in sometime, unless a pod is using that PVC, where you need to delete the pod first and then PVC.

Tristan-Le1 commented 1 year ago

I was able to delete the offending PVC. And then there was the same issue again with a different PVC. I was able to delete that one too. And then cloning worked how it was supposed to. Each time the offending PVC took a long time to delete and the cloned PVC deleted like normal. Am finishing up the cloning test now, after getting through the errors.

amdabhad commented 1 year ago

Checked with Dan McNichol: The above mmxcp error was due to GPFS was down for a bit on a node while running the mmxcp job.

amdabhad commented 1 year ago

@Tristan-Le1 , can you please copy the following on some path at/u/DUMPS/ and please update the path in the issue, thank you!

Tristan-Le1 commented 1 year ago

The requested files and outputs are resident on the system (Archie), PWD: /u/DUMPS/CSI_ISSUE_843

Just something to note, the cso yaml still came out as short as it did before.

Jainbrt commented 1 year ago

@Tristan-Le1 could you please help revisit this and see if this issue is still valid ?

Tristan-Le1 commented 1 year ago

Just noting here so everyone is updated, Abhishek and I have conferred via email on this issue. The issue is to be fixed in a later iteration of CSI. When that becomes available, I will attempt to recreate and close issue.

amdabhad commented 1 year ago

Readjusting the labels as there is workaround documented on this - and also to match with another issue @Jainbrt please revert if disagree.

deeghuge commented 7 months ago

Closing since no updates since long