CSI: Attacher , snapshotter,provisioner,resizer pods restarts multiple times on interim 515 builds

rkomandu commented 2 years ago

Describe the bug

A clear and concise description of what the bug is.

Have observed the CSI pods have restarted on the fresh deployment of the FQDN build (29th Aug) as per the #scale-cnsa-builds channel


Every 10.0s: oc get pods -n ibm-spectrum-scale-operator; oc get pods -n ibm-spectrum-scale ; oc get pods -n ib...  api.sieve.cp.fyre.ibm.com: Mon Aug 29 23:42:19 2022

NAME                                                     READY   STATUS    RESTARTS   AGE
ibm-spectrum-scale-controller-manager-79bc4ccc9b-csf94   1/1     Running   0          23m
NAME                               READY   STATUS    RESTARTS        AGE
ibm-spectrum-scale-gui-0           4/4     Running   1 (4m18s ago)   9m56s
ibm-spectrum-scale-gui-1           4/4     Running   0               2m51s
ibm-spectrum-scale-pmcollector-0   2/2     Running   0               9m26s
ibm-spectrum-scale-pmcollector-1   2/2     Running   0               7m53s
worker0                            2/2     Running   0               9m56s
worker1                            2/2     Running   0               9m56s
worker2                            2/2     Running   0               9m56s
NAME                                                  READY   STATUS             RESTARTS      AGE
ibm-spectrum-scale-csi-2sfd6                          2/3     CrashLoopBackOff   4 (7s ago)    115s
ibm-spectrum-scale-csi-attacher-5cfdf6df9-5rh55       0/1     CrashLoopBackOff   4 (15s ago)   115s
ibm-spectrum-scale-csi-attacher-5cfdf6df9-rht59       0/1     CrashLoopBackOff   4 (15s ago)   115s
ibm-spectrum-scale-csi-nnq52                          2/3     CrashLoopBackOff   4 (28s ago)   115s
ibm-spectrum-scale-csi-operator-658bbc79bf-gxgct      1/1     Running            0             23m
ibm-spectrum-scale-csi-provisioner-869fb676b5-tp6kr   0/1     CrashLoopBackOff   4 (15s ago)   115s
ibm-spectrum-scale-csi-resizer-6c89db96bc-n7fxk       0/1     CrashLoopBackOff   4 (15s ago)   115s
ibm-spectrum-scale-csi-snapshotter-76d564cb76-hk55v   0/1     CrashLoopBackOff   4 (15s ago)   115s
ibm-spectrum-scale-csi-w5w9f                          2/3     CrashLoopBackOff   4 (15s ago)   115s

oc describe pod  ibm-spectrum-scale-csi-2sfd6 -n ibm-spectrum-scale-csi

Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  2m20s                default-scheduler  Successfully assigned ibm-spectrum-scale-csi/ibm-spectrum-scale-csi-2sfd6 to worker1.sieve.cp.fyre.ibm.com by master1.sieve.cp.fyre.ibm.com
  Normal   Pulled     2m19s                kubelet            Container image "cp.icr.io/cp/spectrum/scale/csi/livenessprobe@sha256:933940f13b3ea0abc62e656c1aa5c5b47c04b15d71250413a6b821bd0c58b94e" already present on machine
  Normal   Created    2m19s                kubelet            Created container liveness-probe
  Normal   Started    2m19s                kubelet            Started container liveness-probe
  Normal   Pulled     2m19s                kubelet            Container image "cp.icr.io/cp/spectrum/scale/csi/csi-node-driver-registrar@sha256:4fd21f36075b44d1a423dfb262ad79202ce54e95f5cbc4622a6c1c38ab287ad6" already present on machine
  Normal   Created    2m19s                kubelet            Created container driver-registrar
  Normal   Started    2m19s                kubelet            Started container driver-registrar
  Normal   Pulled     83s (x4 over 2m20s)  kubelet            Container image "cp.icr.io/cp/spectrum/scale/csi/ibm-spectrum-scale-csi-driver@sha256:ada8d41341298ddc2d1c6faa781d7788926d31ad27f605b2fe7ebac1f16884de" already present on machine
  Normal   Created    83s (x4 over 2m19s)  kubelet            Created container ibm-spectrum-scale-csi
  Normal   Started    83s (x4 over 2m19s)  kubelet            Started container ibm-spectrum-scale-csi
  Warning  BackOff    82s (x7 over 2m17s)  kubelet            Back-off restarting failed container

Every 10.0s: oc get pods -n ibm-spectrum-scale-operator; oc get pods -n ibm-spectrum-scale ; oc get pods -n ib...  api.sieve.cp.fyre.ibm.com: Mon Aug 29 23:44:02 2022

NAME                                                     READY   STATUS    RESTARTS   AGE
ibm-spectrum-scale-controller-manager-79bc4ccc9b-csf94   1/1     Running   0          24m
NAME                               READY   STATUS    RESTARTS     AGE
ibm-spectrum-scale-gui-0           4/4     Running   1 (6m ago)   11m
ibm-spectrum-scale-gui-1           4/4     Running   0            4m33s
ibm-spectrum-scale-pmcollector-0   2/2     Running   0            11m
ibm-spectrum-scale-pmcollector-1   2/2     Running   0            9m35s
worker0                            2/2     Running   0            11m
worker1                            2/2     Running   0            11m
worker2                            2/2     Running   0            11m
NAME                                                  READY   STATUS             RESTARTS        AGE
ibm-spectrum-scale-csi-2sfd6                          3/3     Running            5 (109s ago)    3m37s
ibm-spectrum-scale-csi-attacher-5cfdf6df9-5rh55       0/1     CrashLoopBackOff   5 (57s ago)     3m37s
ibm-spectrum-scale-csi-attacher-5cfdf6df9-rht59       0/1     CrashLoopBackOff   5 (57s ago)     3m37s
ibm-spectrum-scale-csi-nnq52                          3/3     Running            5 (2m10s ago)   3m37s
ibm-spectrum-scale-csi-operator-658bbc79bf-gxgct      1/1     Running            0               24m
ibm-spectrum-scale-csi-provisioner-869fb676b5-tp6kr   1/1     Running            6 (37s ago)     3m37s
ibm-spectrum-scale-csi-resizer-6c89db96bc-n7fxk       1/1     Running            6 (37s ago)     3m37s
ibm-spectrum-scale-csi-snapshotter-76d564cb76-hk55v   0/1     CrashLoopBackOff   5 (57s ago)     3m37s
ibm-spectrum-scale-csi-w5w9f                          3/3     Running            5 (117s ago)    3m37s

oc describe pod  ibm-spectrum-scale-csi-attacher-5cfdf6df9-5rh55 -n ibm-spectrum-scale-csi

Events:
  Type     Reason          Age                    From               Message
  ----     ------          ----                   ----               -------
  Normal   Scheduled       4m3s                   default-scheduler  Successfully assigned ibm-spectrum-scale-csi/ibm-spectrum-scale-csi-attacher-5cfdf6df9-5rh55 to worker0.sieve.cp.fyre.ibm.com by master1.sieve.cp.fyre.ibm.com
  Normal   AddedInterface  4m1s                   multus             Add eth0 [10.254.20.23/22] from openshift-sdn
  Normal   Pulled          2m43s (x5 over 4m1s)   kubelet            Container image "cp.icr.io/cp/spectrum/scale/csi/csi-attacher@sha256:8b9c313c05f54fb04f8d430896f5f5904b6cb157df261501b29adc04d2b2dc7b" already present on machine
  Normal   Created         2m42s (x5 over 4m1s)   kubelet            Created container ibm-spectrum-scale-csi-attacher
  Normal   Started         2m42s (x5 over 4m1s)   kubelet            Started container ibm-spectrum-scale-csi-attacher
  Warning  Unhealthy       2m23s (x5 over 3m43s)  kubelet            Liveness probe failed: Get "http://10.254.20.23:8080/healthz/leader-election": dial tcp 10.254.20.23:8080: connect: connection refused
  Normal   Killing         2m23s (x5 over 3m43s)  kubelet            Container ibm-spectrum-scale-csi-attacher failed liveness probe, will be restarted
[root@api.sieve.cp.fyre.ibm.com FQDN-29Aug]#

Every 10.0s: oc get pods -n ibm-spectrum-scale-operator; oc get pods -n ibm-spectrum-scale ; oc get pods -n ib...  api.sieve.cp.fyre.ibm.com: Mon Aug 29 23:44:43 2022

NAME                                                     READY   STATUS    RESTARTS   AGE
ibm-spectrum-scale-controller-manager-79bc4ccc9b-csf94   1/1     Running   0          25m
NAME                               READY   STATUS    RESTARTS        AGE
ibm-spectrum-scale-gui-0           4/4     Running   1 (6m41s ago)   12m
ibm-spectrum-scale-gui-1           4/4     Running   0               5m14s
ibm-spectrum-scale-pmcollector-0   2/2     Running   0               11m
ibm-spectrum-scale-pmcollector-1   2/2     Running   0               10m
worker0                            2/2     Running   0               12m
worker1                            2/2     Running   0               12m
worker2                            2/2     Running   0               12m
NAME                                                  READY   STATUS    RESTARTS        AGE
ibm-spectrum-scale-csi-2sfd6                          3/3     Running   5 (2m30s ago)   4m18s
ibm-spectrum-scale-csi-attacher-5cfdf6df9-5rh55       1/1     Running   6 (98s ago)     4m18s
ibm-spectrum-scale-csi-attacher-5cfdf6df9-rht59       1/1     Running   6 (98s ago)     4m18s
ibm-spectrum-scale-csi-nnq52                          3/3     Running   5 (2m51s ago)   4m18s
ibm-spectrum-scale-csi-operator-658bbc79bf-gxgct      1/1     Running   0               25m
ibm-spectrum-scale-csi-provisioner-869fb676b5-tp6kr   1/1     Running   6 (78s ago)     4m18s
ibm-spectrum-scale-csi-resizer-6c89db96bc-n7fxk       1/1     Running   6 (78s ago)     4m18s
ibm-spectrum-scale-csi-snapshotter-76d564cb76-hk55v   1/1     Running   6 (98s ago)     4m18s
ibm-spectrum-scale-csi-w5w9f                          3/3     Running   5 (2m38s ago)   4m18s

How to Reproduce?

Please list the steps to help development teams reproduce the behavior

... Install the CNSA and CSI by taking the following FQDN build posted in #scale-cnsa-builds channel on 29th Aug 2022 and deploy

https://ibm-systems-storage.slack.com/archives/C02KCKGH755/p1661827704750079

scale-v1-beta file taken from https://github.ibm.com/IBMSpectrumScale/ibm-spectrum-scale-container-native/blob/v5.1.5.0/generated/scale_v1beta1_cluster_cr.yaml

after applying [root@api.sieve.cp.fyre.ibm.com FQDN-29Aug]# oc apply -f ./scale_v1beta1_29Augbuild.yaml namespace/ibm-spectrum-scale created serviceaccount/ibm-spectrum-scale-core created serviceaccount/ibm-spectrum-scale-default created serviceaccount/ibm-spectrum-scale-gui created serviceaccount/ibm-spectrum-scale-pmcollector created role.rbac.authorization.k8s.io/ibm-spectrum-scale-sysmon created rolebinding.rbac.authorization.k8s.io/ibm-spectrum-scale-privileged created rolebinding.rbac.authorization.k8s.io/ibm-spectrum-scale-sysmon created callhome.scale.spectrum.ibm.com/callhome created cluster.scale.spectrum.ibm.com/ibm-spectrum-scale created filesystem.scale.spectrum.ibm.com/remote-sample created remotecluster.scale.spectrum.ibm.com/remotecluster-sample created

Expected behavior

A clear and concise description of what you expected to happen.

They can't restart multiple times on the system when created freshly. Is this expected? Anyone deploys the scale beta file freshly should hit as i understand.

Data Collection and Debugging

Environmental output

What openshift/kubernetes version are you running, and the architecture?
kubectl get pods -o wide -n < csi driver namespace>
kubectl get nodes -o wide
CNSA/Spectrum Scale version
Remote Spectrum Scale version
Output for ./tools/spectrum-scale-driver-snap.sh -n < csi driver namespace> -v

Tool to collect the CSI snap:

./tools/spectrum-scale-driver-snap.sh -n < csi driver namespace>

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

Add labels

Component:
Severity:
Customer Impact:
Customer Probability:
Phase:

Note : See labels for the labels

deeghuge commented 2 years ago

@rkomandu Do you have must-gather logs somewhere ? or you can check the logs of one of the csi pod for example oc logs ibm-spectrum-scale-csi-2sfd6 --previous -n ibm-spectrum-scale-csi

rkomandu commented 2 years ago

it shows as follows.. As you can CNSA didn't had any issue in the Restarts

I0830 06:42:13.599431       1 main.go:48] Version Info: commit (cd7c8950650d7f25ba8c7c3e9ff0fac251cd050e)
I0830 06:42:13.599534       1 gpfs.go:118] gpfs GetScaleDriver
I0830 06:42:13.599541       1 gpfs.go:194] gpfs SetupScaleDriver. name: spectrumscale.csi.ibm.com, version: 2.7.0, nodeID: worker1.sieve.cp.fyre.ibm.com
I0830 06:42:13.599548       1 gpfs.go:236] gpfs PluginInitialize
I0830 06:42:13.599554       1 scale_config.go:94] scale_config LoadScaleConfigSettings
I0830 06:42:13.599724       1 scale_config.go:117] scale_config HandleSecrets
I0830 06:42:13.599788       1 gpfs.go:436] gpfs ValidateScaleConfigParameters.
I0830 06:42:13.599807       1 connectors.go:118] connector GetSpectrumScaleConnector
I0830 06:42:13.599812       1 rest_v2.go:135] rest_v2 NewSpectrumRestV2.
I0830 06:42:13.599820       1 rest_v2.go:158] Created Spectrum Scale connector without SSL mode for ibm-spectrum-scale-gui-ibm-spectrum-scale.apps.sieve.cp.fyre.ibm.com
I0830 06:42:13.599824       1 rest_v2.go:173] rest_v2 GetClusterId
I0830 06:42:13.599835       1 http_utils.go:60] http_utils FormatURL. url: https://ibm-spectrum-scale-gui-ibm-spectrum-scale.apps.sieve.cp.fyre.ibm.com:443/
I0830 06:42:13.599843       1 rest_v2.go:991] rest_v2 doHTTP. endpoint: https://ibm-spectrum-scale-gui-ibm-spectrum-scale.apps.sieve.cp.fyre.ibm.com:443/scalemgmt/v2/cluster, method: GET, param: <nil>
I0830 06:42:13.599847       1 http_utils.go:74] http_utils HttpExecuteUserAuth. type: GET, url: https://ibm-spectrum-scale-gui-ibm-spectrum-scale.apps.sieve.cp.fyre.ibm.com:443/scalemgmt/v2/cluster, user: CsiAdmin
I0830 06:42:13.633722       1 http_utils.go:44] http_utils UnmarshalResponse. response: &{0x584380 0xc0000a0c00 0x634fc0}
I0830 06:42:13.633970       1 rest_v2.go:43] rest_v2 isStatusOK. statusCode: 200
I0830 06:42:13.633985       1 rest_v2.go:237] rest_v2 GetFilesystemMountDetails. filesystemName: remote-sample
I0830 06:42:13.633990       1 rest_v2.go:991] rest_v2 doHTTP. endpoint: https://ibm-spectrum-scale-gui-ibm-spectrum-scale.apps.sieve.cp.fyre.ibm.com:443/scalemgmt/v2/filesystems/remote-sample, method: GET, param: <nil>
I0830 06:42:13.633999       1 http_utils.go:74] http_utils HttpExecuteUserAuth. type: GET, url: https://ibm-spectrum-scale-gui-ibm-spectrum-scale.apps.sieve.cp.fyre.ibm.com:443/scalemgmt/v2/filesystems/remote-sample, user: CsiAdmin
I0830 06:42:13.644583       1 http_utils.go:44] http_utils UnmarshalResponse. response: &{0x584380 0xc00013e240 0x634fc0}
I0830 06:42:13.644786       1 rest_v2.go:43] rest_v2 isStatusOK. statusCode: 400
E0830 06:42:13.644805       1 rest_v2.go:244] Unable to get filesystem details for remote-sample: Remote call completed with error [400 Bad Request]. Error in response [&{[] {400 Invalid value in filesystemName} {}}]
E0830 06:42:13.644815       1 gpfs.go:277] Error in getting filesystem details for remote-sample
E0830 06:42:13.644820       1 gpfs.go:201] Error in plugin initialization: Remote call completed with error [400 Bad Request]. Error in response [&{[] {400 Invalid value in filesystemName} {}}]
F0830 06:42:13.644824       1 main.go:77] Failed to initialize Scale CSI Driver: Remote call completed with error [400 Bad Request]. Error in response [&{[] {400 Invalid value in filesystemName} {}}]

CR file snippet

apiVersion: scale.spectrum.ibm.com/v1beta1
kind: Filesystem
metadata:
  labels:
    app.kubernetes.io/instance: ibm-spectrum-scale
    app.kubernetes.io/name: cluster
  name: remote-sample
  namespace: ibm-spectrum-scale
spec:
  remote:
    cluster: remotecluster-sample
    fs: fs1
---
apiVersion: scale.spectrum.ibm.com/v1beta1
kind: RemoteCluster
metadata:
  labels:
    app.kubernetes.io/instance: ibm-spectrum-scale
    app.kubernetes.io/name: cluster
  name: remotecluster-sample
  namespace: ibm-spectrum-scale

deeghuge commented 2 years ago

Hi Ravi, The CSI driver was restarting because filesystem creation took time and since CSI driver is restarting sidecar also gets restarted.

rkomandu commented 2 years ago

@deeghuge , Did it also observed in other clusters. if that is the case, worth to document that it would do so IMO.

Jainbrt commented 2 years ago

@deeghuge @rkomandu could you please help put FQI labels ?

rkomandu commented 2 years ago

Deployed RC4 build on Fyre and haven't come across restarts like before in CSI NS for pods.. .. only the attacher was restarted as shown below , due to Liveness probe ? if this is expected then close this task.

NAME                                                  READY   STATUS    RESTARTS        AGE    IP              NODE                            NOMINATED NODE
  READINESS GATES
ibm-spectrum-scale-csi-6c5nv                          3/3     Running   0               3m7s   10.17.63.249    worker2.sieve.cp.fyre.ibm.com   <none>
  <none>
ibm-spectrum-scale-csi-attacher-5cfdf6df9-8ldtc       1/1     Running   1 (2m47s ago)   3m7s   10.254.21.162   worker0.sieve.cp.fyre.ibm.com   <none>
  <none>
ibm-spectrum-scale-csi-attacher-5cfdf6df9-nhwq5       1/1     Running   0               3m7s   10.254.13.148   worker1.sieve.cp.fyre.ibm.com   <none>
  <none>
ibm-spectrum-scale-csi-bjvpk                          3/3     Running   0               3m7s   10.17.59.29     worker1.sieve.cp.fyre.ibm.com   <none>
  <none>
ibm-spectrum-scale-csi-gl57f                          3/3     Running   0               3m7s   10.17.59.26     worker0.sieve.cp.fyre.ibm.com   <none>
  <none>
ibm-spectrum-scale-csi-operator-5477865bf-7dq7c       1/1     Running   0               18m    10.254.21.157   worker0.sieve.cp.fyre.ibm.com   <none>
  <none>
ibm-spectrum-scale-csi-provisioner-869fb676b5-vx4pk   1/1     Running   0               3m7s   10.254.13.150   worker1.sieve.cp.fyre.ibm.com   <none>
  <none>
ibm-spectrum-scale-csi-resizer-6c89db96bc-p2hph       1/1     Running   0               3m7s   10.254.13.151   worker1.sieve.cp.fyre.ibm.com   <none>
  <none>
ibm-spectrum-scale-csi-snapshotter-76d564cb76-vnhph   1/1     Running   0               3m7s   10.254.13.149   worker1.sieve.cp.fyre.ibm.com   <none>
  <none>

deeghuge commented 7 months ago

With latest changes, we are not seeing restart like explained in above issue. Closing now please reopen if you see issue again.

IBM / ibm-spectrum-scale-csi