dell / csm

Dell Container Storage Modules (CSM)
Apache License 2.0
67 stars 15 forks source link

[BUG]: Embedded Unishpere endpoint failover fails #1433

Open Bvreela opened 4 weeks ago

Bvreela commented 4 weeks ago

Bug Description

Primary Endpoint failover to backup not working automatically, manual config change required.

Logs

PVC log

Warning FailedAttachVolume 37s (x7 over 103s) attachdetach-controller AttachVolume.Attach failed for volume "pmax-54b40cd70e" : rpc error: code = Internal desc = failure checking volume (Array: 000120201026, Volume: 0011E)status Bad Gateway

Deployment log

Events:

Normal WaitForFirstConsumer 3m38s persistentvolume-controller waiting for first consumer to be created before binding Warning ProvisioningFailed 87s csi-powermax.dellemc.com_powermax-controller-5ddd88946c-6wtll_fd6b1165-e955-49ca-bfa3-1795eb8a34ec failed to provision volume with StorageClass "powermax-nvmetcp": rpc error: code = Internal desc = Could not retrieve StoragePool SRP_1. Error(Bad Gateway) Normal Provisioning 86s (x2 over 3m37s) csi-powermax.dellemc.com_powermax-controller-5ddd88946c-6wtll_fd6b1165-e955-49ca-bfa3-1795eb8a34ec External provisioner is provisioning volume for claim "default/norbi-csi-deployment" Normal ExternalProvisioning 2s (x16 over 3m37s) persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'csi-powermax.dellemc.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.

CSI driver still trying to connect primary endpoint,

powermax-controller-5b4bdc4cd7-j9k8j reverseproxy time="2024-08-20T16:39:18Z" level=info msg="Request ID: 15855 - Unisphere RESTAPI response time: 2m11.052621695s" powermax-controller-5b4bdc4cd7-j9k8j reverseproxy time="2024-08-20T16:39:18Z" level=info msg="Request ID: 15855 - Total time: 2m11.052658975s" powermax-controller-5b4bdc4cd7-j9k8j reverseproxy time="2024-08-20T16:39:18Z" level=info msg="Lock: https://10.179.46.144:8443-Read, Active(1/5), Queued(0/50)" powermax-controller-5b4bdc4cd7-j9k8j reverseproxy time="2024-08-20T16:39:18Z" level=info msg="Request ID: 15857 - GET /univmax/restapi/100/sloprovisioning/symmetrix/000120201026/volume/0011E" powermax-controller-5b4bdc4cd7-j9k8j reverseproxy time="2024-08-20T16:39:18Z" level=info msg="Lock: https://10.179.46.144:8443-Read, Active(2/5), Queued(0/50)" powermax-controller-5b4bdc4cd7-j9k8j reverseproxy time="2024-08-20T16:39:18Z" level=info msg="Request ID: 15857 - Obtained Read lock" powermax-controller-5b4bdc4cd7-j9k8j reverseproxy time="2024-08-20T16:39:18Z" level=info msg="Request ID: 15857 - Read Lock time: 14.311┬╡s"

  1. Switch to backup endpoint not working automatically, we had to update CSI Driver configuration manually.

Primary: https://10.179.46.145:8443/ Backup: https://10.179.46.144:8443/

  1. Now we can create new pods without issues,

                        Normal   WaitForFirstConsumer   7m1s                   persistentvolume-controller                                                                         waiting for first consumer to be created before binding

    Warning ProvisioningFailed 2m39s (x2 over 4m50s) csi-powermax.dellemc.com_powermax-controller-5ddd88946c-6wtll_fd6b1165-e955-49ca-bfa3-1795eb8a34ec failed to provision volume with StorageClass "powermax-nvmetcp": rpc error: code = Internal desc = Could not retrieve StoragePool SRP_1. Error(Bad Gateway) Normal Provisioning 2m37s (x3 over 7m) csi-powermax.dellemc.com_powermax-controller-5ddd88946c-6wtll_fd6b1165-e955-49ca-bfa3-1795eb8a34ec External provisioner is provisioning volume for claim "default/norbi-csi-deployment" Normal ExternalProvisioning 70s (x25 over 7m) persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'csi-powermax.dellemc.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered. Normal Provisioning 23s csi-powermax.dellemc.com_powermax-controller-6c5cc7f5d-b9cvk_8cf832fe-4ad6-419a-ac74-cacbbb3742b8 External provisioner is provisioning volume for claim "default/norbi-csi-deployment" Normal ProvisioningSucceeded 20s csi-powermax.dellemc.com_powermax-controller-6c5cc7f5d-b9cvk_8cf832fe-4ad6-419a-ac74-cacbbb3742b8 Successfully provisioned volume pmax-6d80446f29

Pod running successfully,

k get pods

NAME READY STATUS RESTARTS AGE norbi-csi-deployment-8685b684bc-p6blk 1/1 Running 0 7m52s

Screenshots

No response

Additional Environment Information

No response

Steps to Reproduce

  1. Disabled switch port interface of control station where primary Unisphere is running,
  2. confirm unreachable
  3. verify no pod impact, successful
  4. attempt to provision new pods, FAILS (logs attached)

Expected Behavior

automatic failure to the configured endpoint backup

CSM Driver(s)

Release v2.11.0

Installation Type

Helm

Container Storage Modules Enabled

'CSM Resiliency Module' is also installed

Container Orchestrator

Canonical Kubernetes

Operating System

Ubuntu 22.04 LTS 5.15.0-101-generic

csmbot commented 4 weeks ago

@Bvreela: Thank you for submitting this issue!

The issue is currently awaiting triage. Please make sure you have given us as much context as possible.

If the maintainers determine this is a relevant issue, they will remove the needs-triage label and respond appropriately.


We want your feedback! If you have any questions or suggestions regarding our contributing process/workflow, please reach out to us at container.storage.modules@dell.com.

donatwork commented 4 weeks ago

Assigning for Surya for triage.