IBM / ibm-spectrum-scale-csi

The IBM Spectrum Scale Container Storage Interface (CSI) project enables container orchestrators, such as Kubernetes and OpenShift, to manage the life-cycle of persistent storage.
Apache License 2.0
66 stars 49 forks source link

Nodepublish is taking approx 3 mins to complete #1104

Closed saurabhwani5 closed 7 months ago

saurabhwani5 commented 7 months ago

Describe the bug

While running automation nodepublish is taking 3 min with following log where I was running test cases in parallel with workers as 7

[root@saurabh29-master ~]# oc logs ibm-spectrum-scale-csi-fkdd9  | grep cde0e37b-9b7c-456b-8f7a-aee54753d1cb
I0227 10:55:42.099117       1 utils.go:61] [cde0e37b-9b7c-456b-8f7a-aee54753d1cb] GRPC call: /csi.v1.Node/NodePublishVolume
I0227 10:55:42.099292       1 nodeserver.go:142] [cde0e37b-9b7c-456b-8f7a-aee54753d1cb] NodePublishVolume - request: &csi.NodePublishVolumeRequest{VolumeId:"1;1;4181049054023231843;263B0B0A:6589AA52;cg-szcwq;pvc-2b382989-28e8-4dd7-9c59-2b92998b226d;/ibm/remotetest/cg-szcwq/pvc-2b382989-28e8-4dd7-9c59-2b92998b226d", PublishContext:map[string]string(nil), StagingTargetPath:"", TargetPath:"/var/lib/kubelet/pods/45a8a72c-93ce-4820-ab44-513be80de3f6/volumes/kubernetes.io~csi/pvc-2b382989-28e8-4dd7-9c59-2b92998b226d/mount", VolumeCapability:(*csi.VolumeCapability)(0xc0006e6280), Readonly:false, Secrets:map[string]string(nil), VolumeContext:map[string]string{"consistencyGroup":"cg-szcwq", "csi.storage.k8s.io/ephemeral":"false", "csi.storage.k8s.io/pod.name":"snap-start-pod-wvkaqc", "csi.storage.k8s.io/pod.namespace":"default", "csi.storage.k8s.io/pod.uid":"45a8a72c-93ce-4820-ab44-513be80de3f6", "csi.storage.k8s.io/pv/name":"pvc-2b382989-28e8-4dd7-9c59-2b92998b226d", "csi.storage.k8s.io/pvc/name":"pvc-ygepb", "csi.storage.k8s.io/pvc/namespace":"default", "csi.storage.k8s.io/serviceAccount.name":"default", "gid":"99", "permissions":"755", "storage.kubernetes.io/csiProvisionerIdentity":"1708623436132-54-spectrumscale.csi.ibm.com", "uid":"99", "version":"2", "volBackendFs":"remotetest"}, XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0}
I0227 10:57:01.002780       1 nodeserver.go:255] [cde0e37b-9b7c-456b-8f7a-aee54753d1cb] NodePublishVolume - the target directory [/var/lib/kubelet/pods/45a8a72c-93ce-4820-ab44-513be80de3f6/volumes/kubernetes.io~csi/pvc-2b382989-28e8-4dd7-9c59-2b92998b226d/mount] is created successfully
I0227 10:57:01.092590       1 nodeserver.go:286] [cde0e37b-9b7c-456b-8f7a-aee54753d1cb] NodePublishVolume - successfully mounted [/var/lib/kubelet/pods/45a8a72c-93ce-4820-ab44-513be80de3f6/volumes/kubernetes.io~csi/pvc-2b382989-28e8-4dd7-9c59-2b92998b226d/mount] using BINDMOUNT
I0227 10:58:41.743953       1 utils.go:79] [cde0e37b-9b7c-456b-8f7a-aee54753d1cb] Time taken to execute /csi.v1.Node/NodePublishVolume request(in milliseconds): 179644

error logs :

 'message': 'MountVolume.SetUp failed for volume '
            '"pvc-2b382989-28e8-4dd7-9c59-2b92998b226d" : rpc error: code = '
            'DeadlineExceeded desc = context deadline exceeded',

How to Reproduce?

Please list the steps to help development teams reproduce the behavior

  1. Install CSI DCUT 2.11 images on k8s as following :

    [root@saurabh29-master ~]# oc get pods
    NAME                                                  READY   STATUS    RESTARTS      AGE
    ibm-spectrum-scale-csi-4dkqb                          3/3     Running   0             132m
    ibm-spectrum-scale-csi-attacher-78dc4fc459-b572p      1/1     Running   0             132m
    ibm-spectrum-scale-csi-attacher-78dc4fc459-xlp92      1/1     Running   0             132m
    ibm-spectrum-scale-csi-gdj68                          3/3     Running   0             132m
    ibm-spectrum-scale-csi-operator-879fcc947-wmqbs       1/1     Running   7 (4d ago)    5d12h
    ibm-spectrum-scale-csi-provisioner-5864cc55bb-bmblx   1/1     Running   0             132m
    ibm-spectrum-scale-csi-resizer-69446b6bc-8nq6b        1/1     Running   1 (87m ago)   132m
    ibm-spectrum-scale-csi-snapshotter-b844fd99d-vfzl7    1/1     Running   0             132m
    [root@saurabh29-master ~]# oc get cso
    NAME                     VERSION   SUCCESS
    ibm-spectrum-scale-csi   2.11.0    True
    [root@saurabh29-master ~]# oc describe pod | grep quay
    Image:         quay.io/ibm-spectrum-scale-dev/ibm-spectrum-scale-csi-driver@sha256:c110e2e0427f3799dc3a316db4748be5e8ca98bbaf50c7e3d8c7777c91c1375f
    Image:         quay.io/ibm-spectrum-scale-dev/ibm-spectrum-scale-csi-driver@sha256:c110e2e0427f3799dc3a316db4748be5e8ca98bbaf50c7e3d8c7777c91c1375f
    Image:         quay.io/ibm-spectrum-scale-dev/ibm-spectrum-scale-csi-operator@sha256:90c891b61e51be5ab689a595a4cd06919eacc659fc53967014c9e8d0eb4f7629
      CSI_DRIVER_IMAGE:      quay.io/ibm-spectrum-scale-dev/ibm-spectrum-scale-csi-driver@sha256:c110e2e0427f3799dc3a316db4748be5e8ca98bbaf50c7e3d8c7777c91c1375f
  2. I saw this issue while running automation and logs are as following where following test case got failed with timeout error while mounting the pod

    [ INFO]  2024-02-27 02:53:50 (manager.py:508)  creating test meta for tests/test_csi/remotecluster_volume_snapshot.py::test_driver_volume_snapshot_Version2_5_to_LW_5_shallow_copy
    [ INFO]  2024-02-27 02:53:50 (conftest.py:572)  Using kubeconfig ~/.kube/config to set cluster config.
    [ INFO]  2024-02-27 02:53:50 (namespace.py:152)  Namespace Check  : ibm-spectrum-scale-csi does not exist
    [ INFO]  2024-02-27 02:53:50 (namespace.py:149)  Namespace Check  : ibm-spectrum-scale-csi-driver exists
    [ INFO]  2024-02-27 02:53:50 (csiscaleoperator.py:146)  SpectrumScale CSI operator custom object exists
    [ INFO]  2024-02-27 02:53:52 (utils.py:1094)  Spectrum Scale GUI REST Api request is successful
    [ INFO]  2024-02-27 02:53:52 (utils.py:1095)  Guihost is 10.11.75.12 and User is csiadmin
    [ INFO]  2024-02-27 02:53:52 (namespace.py:149)  Namespace Check  : ibm-spectrum-scale-csi-driver exists
    [ INFO]  2024-02-27 02:54:00 (utils.py:1094)  Spectrum Scale GUI REST Api request is successful
    [ INFO]  2024-02-27 02:54:00 (utils.py:1095)  Guihost is 10.11.57.54 and User is remoteuser
    [ INFO]  2024-02-27 02:54:00 (dir.py:73)  Directory Check : Checking for directory LW
    [ INFO]  2024-02-27 02:54:04 (dir.py:29)  Directory LW is already created
    -------------------------------Captured log call-------------------------------- 
    [ INFO]  2024-02-27 02:54:06 (api_object.py:171)  scale version is 5.1.9-1
    [ INFO]  2024-02-27 02:54:06 (scale_volume_snapshot.py:62)  ----------------------------------------------------------------------------------------------------
    [ INFO]  2024-02-27 02:54:06 (storageclass.py:96)  creating storageclass "sc-qscmffzxk" with parameters "{'volBackendFs': 'remotetest', 'uid': '99', 'gid': '99', 'permissions': '755', 'version': '2', 'consistencyGroup': 'cg-szcwq'}"
    [ INFO]  2024-02-27 02:54:06 (utils.py:126)  waiting for condition: <Condition (name: sc-qscmffzxk to be ready, met: False)>
    [ INFO]  2024-02-27 02:54:06 (utils.py:134)  wait_for_condition max wait time is 180 seconds
    [ INFO]  2024-02-27 02:54:06 (utils.py:159)  wait_for_condition completed (total=0.005028724670410156s) <Condition (name: sc-qscmffzxk to be ready, met: True)>
    [ INFO]  2024-02-27 02:54:06 (persistentvolumeclaim.py:97)  Creating persistentvolumeclaim: pvc-ygepb in namespace default with parameters {'access_modes': 'ReadWriteMany', 'storage': '1Gi'}
    [ INFO]  2024-02-27 02:54:06 (utils.py:126)  waiting for condition: <Condition (name: pvc-ygepb to be ready, met: False)>
    [ INFO]  2024-02-27 02:54:06 (utils.py:134)  wait_for_condition max wait time is 180 seconds
    [ INFO]  2024-02-27 02:54:06 (utils.py:154)  wait_for_condition is sleeping for 5 seconds... 0/36
    [ INFO]  2024-02-27 02:54:11 (utils.py:154)  wait_for_condition is sleeping for 5 seconds... 1/36
    [ INFO]  2024-02-27 02:54:16 (utils.py:154)  wait_for_condition is sleeping for 5 seconds... 2/36
    [ INFO]  2024-02-27 02:54:21 (utils.py:154)  wait_for_condition is sleeping for 5 seconds... 3/36
    [ INFO]  2024-02-27 02:54:26 (utils.py:154)  wait_for_condition is sleeping for 5 seconds... 4/36
    [ INFO]  2024-02-27 02:54:31 (utils.py:154)  wait_for_condition is sleeping for 5 seconds... 5/36
    [ INFO]  2024-02-27 02:54:36 (utils.py:154)  wait_for_condition is sleeping for 5 seconds... 6/36
    [ INFO]  2024-02-27 02:54:41 (utils.py:154)  wait_for_condition is sleeping for 5 seconds... 7/36
    [ INFO]  2024-02-27 02:54:46 (utils.py:154)  wait_for_condition is sleeping for 5 seconds... 8/36
    [ INFO]  2024-02-27 02:54:51 (utils.py:154)  wait_for_condition is sleeping for 5 seconds... 9/36
    [ INFO]  2024-02-27 02:54:56 (utils.py:154)  wait_for_condition is sleeping for 5 seconds... 10/36
    [ INFO]  2024-02-27 02:55:01 (utils.py:154)  wait_for_condition is sleeping for 5 seconds... 11/36
    [ INFO]  2024-02-27 02:55:06 (utils.py:159)  wait_for_condition completed (total=60.14217567443848s) <Condition (name: pvc-ygepb to be ready, met: True)>
    [ INFO]  2024-02-27 02:55:06 (persistentvolumeclaim.py:332)  PVC Check : pvc-ygepb is BOUND succesfully ('Time for PVC BOUND : 1minutes', '0seconds')
    [ INFO]  2024-02-27 02:55:19 (persistentvolumeclaim.py:352)  PVC Check : Fileset cg-szcwq has been created successfully for version=2 SC
    [ INFO]  2024-02-27 02:55:32 (persistentvolumeclaim.py:210)  PVC Check: Checking size for pvc pvc-ygepb
    [ INFO]  2024-02-27 02:55:36 (fileset.py:541)  PVC Check : Minimum quota expected = 1048576   Actual quota set = 1048576
    [ INFO]  2024-02-27 02:55:36 (fileset.py:572)  PVC Check : Actual maximun number of inodes 0 is greater than expected maximum inodes 0
    [ INFO]  2024-02-27 02:55:36 (persistentvolumeclaim.py:392)  PVC Check : Fileset pvc-2b382989-28e8-4dd7-9c59-2b92998b226d has been created successfully
    [ INFO]  2024-02-27 02:55:36 (pod.py:172)  creating pod "snap-start-pod-wvkaqc" using pvc "pvc-ygepb" with mount path "/usr/share/nginx/html/scale" and read only: False in namespace "default"
    [ INFO]  2024-02-27 02:55:36 (pod.py:175)  pod "snap-start-pod-wvkaqc" is using image "docker-na-public.artifactory.swg-devops.com/sys-spectrum-scale-team-test-environment-docker-local/nginx:1.22.0"
    [ INFO]  2024-02-27 02:55:36 (utils.py:126)  waiting for condition: <Condition (name: snap-start-pod-wvkaqc to be ready, met: False)>
    [ INFO]  2024-02-27 02:55:36 (utils.py:134)  wait_for_condition max wait time is 180 seconds
    [ INFO]  2024-02-27 02:55:36 (utils.py:154)  wait_for_condition is sleeping for 5 seconds... 0/36
    [ INFO]  2024-02-27 02:55:41 (utils.py:154)  wait_for_condition is sleeping for 5 seconds... 1/36
    [ INFO]  2024-02-27 02:55:46 (utils.py:154)  wait_for_condition is sleeping for 5 seconds... 2/36

Expected behavior

nodepublish execution time should be reduced and pod mounting and deletion should take less time

Data Collection and Debugging

/scale-csi/D.1104
all.20240227221333.1209137.tar  csisnap.tar.gz
deeghuge commented 7 months ago

@saurabhwani5 is this reproducible with latest build ?

saurabhwani5 commented 7 months ago

closing as I don't see this issue happening with latest dev 2.11.0