IBM / ibm-spectrum-scale-csi

The IBM Spectrum Scale Container Storage Interface (CSI) project enables container orchestrators, such as Kubernetes and OpenShift, to manage the life-cycle of persistent storage.
Apache License 2.0
68 stars 49 forks source link

Liveness probe failed: Get "http://172.30.254.5:8080/healthz/leader-election": context deadline exceeded (Client.Timeout exceeded while awaiting headers) #1209

Closed corrtia closed 2 months ago

corrtia commented 2 months ago

Describe the bug

attacher , snapshotter,provisioner,resizer pods liveness probe failed.

~# kubectl describe pod -n ibm-spectrum-scale-csi-driver ibm-spectrum-scale-csi-attacher-85c444fc7b-6xmrd 
Name:                 ibm-spectrum-scale-csi-attacher-85c444fc7b-6xmrd
Namespace:            ibm-spectrum-scale-csi-driver
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      ibm-spectrum-scale-csi-attacher
Node:                 nm129-gpu-258/10.200.100.94
Start Time:           Thu, 05 Sep 2024 13:43:30 +0800
Labels:               app=ibm-spectrum-scale-csi-attacher
                      app.kubernetes.io/instance=ibm-spectrum-scale-csi-operator
                      app.kubernetes.io/managed-by=ibm-spectrum-scale-csi-operator
                      app.kubernetes.io/name=ibm-spectrum-scale-csi-operator
                      pod-template-hash=85c444fc7b
                      product=ibm-spectrum-scale-csi
                      release=ibm-spectrum-scale-csi-operator
Annotations:          cni.projectcalico.org/containerID: 421758009bcbc4ffef300c9551a66fc223e4e986f622f0a9e3230d8929f9c015
                      cni.projectcalico.org/podIP: 172.30.112.6/32
                      cni.projectcalico.org/podIPs: 172.30.112.6/32
                      productID: ibm-spectrum-scale-csi-operator
                      productName: IBM Spectrum Scale CSI Operator
                      productVersion: 2.11.0
Status:               Running
IP:                   172.30.112.6
IPs:
  IP:           172.30.112.6
Controlled By:  ReplicaSet/ibm-spectrum-scale-csi-attacher-85c444fc7b
Containers:
  ibm-spectrum-scale-csi-attacher:
    Container ID:  containerd://3a23afdc9c9b808bf1f209c38d9bae08987d6432c9f06e133a114d197c9b43fd
    Image:         registry.k8s.io/sig-storage/csi-attacher@sha256:d69cc72025f7c40dae112ff989e920a3331583497c8dfb1600c5ae0e37184a29
    Image ID:      registry.k8s.io/sig-storage/csi-attacher@sha256:d69cc72025f7c40dae112ff989e920a3331583497c8dfb1600c5ae0e37184a29
    Port:          8080/TCP
    Host Port:     0/TCP
    Args:
      --v=5
      --csi-address=$(ADDRESS)
      --resync=10m
      --timeout=2m
      --default-fstype=gpfs
      --leader-election=true
      --leader-election-lease-duration=$(LEADER_ELECTION_LEASE_DURATION)
      --leader-election-renew-deadline=$(LEADER_ELECTION_RENEW_DEADLINE)
      --leader-election-retry-period=$(LEADER_ELECTION_RETRY_PERIOD)
      --http-endpoint=:8080
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Thu, 05 Sep 2024 14:17:21 +0800
      Finished:     Thu, 05 Sep 2024 14:18:00 +0800
    Ready:          False
    Restart Count:  15
    Limits:
      cpu:                300m
      ephemeral-storage:  5Gi
      memory:             800Mi
    Requests:
      cpu:                20m
      ephemeral-storage:  1Gi
      memory:             20Mi
    Liveness:             http-get http://:http-endpoint/healthz/leader-election delay=10s timeout=10s period=20s #success=1 #failure=1
    Environment:
      ADDRESS:                         /var/lib/kubelet/plugins/spectrumscale.csi.ibm.com/csi.sock
      LEADER_ELECTION_LEASE_DURATION:  137s
      LEADER_ELECTION_RENEW_DEADLINE:  107s
      LEADER_ELECTION_RETRY_PERIOD:    26s
    Mounts:
      /var/lib/kubelet/plugins/spectrumscale.csi.ibm.com from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xwv8d (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  socket-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins/spectrumscale.csi.ibm.com
    HostPathType:  DirectoryOrCreate
  kube-api-access-xwv8d:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              scale=true
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/infra:NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                           Age                   From               Message
  ----     ------                           ----                  ----               -------
  Normal   Scheduled                        36m                   default-scheduler  Successfully assigned ibm-spectrum-scale-csi-driver/ibm-spectrum-scale-csi-attacher-85c444fc7b-6xmrd to nm129-gpu-258
  Normal   Started                          35m (x3 over 36m)     kubelet            Started container ibm-spectrum-scale-csi-attacher
  Normal   Pulled                           35m (x4 over 36m)     kubelet            Container image "registry.k8s.io/sig-storage/csi-attacher@sha256:d69cc72025f7c40dae112ff989e920a3331583497c8dfb1600c5ae0e37184a29" already present on machine
  Normal   Created                          35m (x4 over 36m)     kubelet            Created container ibm-spectrum-scale-csi-attacher
  Normal   Killing                          35m (x3 over 36m)     kubelet            Container ibm-spectrum-scale-csi-attacher failed liveness probe, will be restarted
  Warning  Unhealthy                        21m (x10 over 36m)    kubelet            Liveness probe failed: Get "http://172.30.112.6:8080/healthz/leader-election": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  FailedToRetrieveImagePullSecret  109s (x158 over 37m)  kubelet            Unable to retrieve some image pull secrets (ibm-spectrum-scale-csi-registrykey, ibm-entitlement-key); attempting to pull the image may not succeed.
kubectl logs -f  -n ibm-spectrum-scale-csi-driver ibm-spectrum-scale-csi-attacher-85c444fc7b-6xmrd 
I0905 06:17:21.349535       1 main.go:97] Version: v4.5.0
I0905 06:17:21.440016       1 connection.go:215] Connecting to unix:///var/lib/kubelet/plugins/spectrumscale.csi.ibm.com/csi.sock
I0905 06:17:21.441726       1 common.go:138] Probing CSI driver for readiness
I0905 06:17:21.441761       1 connection.go:244] GRPC call: /csi.v1.Identity/Probe
I0905 06:17:21.441772       1 connection.go:245] GRPC request: {}
I0905 06:17:21.469509       1 connection.go:251] GRPC response: {"ready":{"value":true}}
I0905 06:17:21.531898       1 connection.go:252] GRPC error: <nil>
I0905 06:17:21.531969       1 connection.go:244] GRPC call: /csi.v1.Identity/GetPluginInfo
I0905 06:17:21.531982       1 connection.go:245] GRPC request: {}
I0905 06:17:21.532818       1 connection.go:251] GRPC response: {"name":"spectrumscale.csi.ibm.com","vendor_version":"2.11.0"}
I0905 06:17:21.532848       1 connection.go:252] GRPC error: <nil>
I0905 06:17:21.532872       1 main.go:154] CSI driver name: "spectrumscale.csi.ibm.com"
I0905 06:17:21.532917       1 connection.go:244] GRPC call: /csi.v1.Identity/GetPluginCapabilities
I0905 06:17:21.532935       1 connection.go:245] GRPC request: {}
I0905 06:17:21.533032       1 main.go:180] ServeMux listening at ":8080"
I0905 06:17:21.534347       1 connection.go:251] GRPC response: {"capabilities":[{"Type":{"Service":{"type":1}}}]}
I0905 06:17:21.534408       1 connection.go:252] GRPC error: <nil>
I0905 06:17:21.534429       1 connection.go:244] GRPC call: /csi.v1.Controller/ControllerGetCapabilities
I0905 06:17:21.534445       1 connection.go:245] GRPC request: {}
I0905 06:17:21.535447       1 connection.go:251] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":5}}},{"Type":{"Rpc":{"type":9}}},{"Type":{"Rpc":{"type":7}}}]}
I0905 06:17:21.535475       1 connection.go:252] GRPC error: <nil>
I0905 06:17:21.535570       1 main.go:230] CSI driver supports ControllerPublishUnpublish, using real CSI handler
I0905 06:17:21.536337       1 leaderelection.go:250] attempting to acquire leader lease ibm-spectrum-scale-csi-driver/external-attacher-leader-spectrumscale-csi-ibm-com...
I0905 06:17:21.550446       1 leaderelection.go:354] lock is held by ibm-spectrum-scale-csi-attacher-744497cfff-zn5kx and has not yet expired
I0905 06:17:21.550490       1 leaderelection.go:255] failed to acquire lease ibm-spectrum-scale-csi-driver/external-attacher-leader-spectrumscale-csi-ibm-com
I0905 06:17:21.550534       1 leader_election.go:184] new leader detected, current leader: ibm-spectrum-scale-csi-attacher-744497cfff-zn5kx
I0905 06:17:52.986211       1 leaderelection.go:354] lock is held by ibm-spectrum-scale-csi-attacher-744497cfff-zn5kx and has not yet expired
I0905 06:17:52.986248       1 leaderelection.go:255] failed to acquire lease ibm-spectrum-scale-csi-driver/external-attacher-leader-spectrumscale-csi-ibm-com

How to Reproduce?

Please list the steps to help development teams reproduce the behavior

  1. ...

Expected behavior

A clear and concise description of what you expected to happen.

Data Collection and Debugging

Environmental output

Tool to collect the CSI snap:

./tools/storage-scale-driver-snap.sh -n < csi driver namespace>

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

Add labels

Note : See labels for the labels

hemalathagajendran commented 2 months ago

@corrtia This seems to be the default behaviour to fail leader election when other instance is holding the lease which is not yet expired. Also you could see those logs are INFO level logs and it is not even warning. Do you still face any other issues due to this?

corrtia commented 2 months ago

This issue is due to a cluster cni failure that causes the Liveness probe to keep failing.The cni plugin I'm using can't access the IP address of the node itself.sorry!