dell / csm

Dell Container Storage Modules (CSM)
Apache License 2.0
71 stars 16 forks source link

[BUG]: PowerStore CSI driver NVME TCP connectivity, attach volume successful, mount failed with error: csi_sock: connect: connection refused #496

Closed LeoHu1985 closed 2 years ago

LeoHu1985 commented 2 years ago

How can the Team help you today?

PowerStore CSI driver NVME TCP connectivity, attach volume successful, mount failed with error:

Warning FailedMount 8s (x8 over 72s) kubelet MountVolume.MountDevice failed for volume "csivol-a3491527fc" : rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/csi-powerstore.dellemc.com/csi_sock: connect: connection refused"

Details: ? SLES 15 SP3 VM running on ESXi 7.0.3 (All NVME TCP enabled, NVME module installed) K8S 1.24.6 (3 VMs cluster) PowerStore CSI driver 2.4

CSI driver installed successfully with NVME TCP topology, CSI driver successfully added nvme hosts into PowerStore with NVMe TCP initiator NQNs, and when creating pvcs and applications, CSI was able to automatically create volumes on powerstore and map them to NVMEe hosts in K8S cluster, k8s hosts successfully showed those nvme volume in nvme list/fdisk -l output, k8s also showed pvcs created and bound successfully, volumes attached to pods successfully,

however application pods stuck on container-creating phase due to failed mount volumes, mount failed with error:

Warning FailedMount 8s (x8 over 72s) kubelet MountVolume.MountDevice failed for volume "csivol-a3491527fc" : rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/csi-powerstore.dellemc.com/csi_sock: connect: connection refused"

Also I have tried to manually mount a test volume from the same powerstore onto the k8s VM's linux OS via NVME TCP, mkfs.ext4 and mount all completed successfully.

Any advice to fix this would be appreciated.

prablr79 commented 2 years ago

@spriya-m can you check this ?

LeoHu1985 commented 2 years ago

I can share the environment access thru zoom if needed, very easy to recreate the issue.

satyakonduri commented 2 years ago

Hi @LeoHu1985 We have verified this issue in our environment and application pods are going to a running state. Can you please share the driver node pod logs on which that application is scheduled? It will help us to debug further. Thank you

LeoHu1985 commented 2 years ago

Thanks @satyakonduri for your kind reply

Its interesting, csi driver pod went into crash loop after deploying application using powerstore storage class.

k8s-0-b-1:/home/lab/applications/yugabyte # kubectl get pods -n csi-powerstore NAME READY STATUS RESTARTS AGE powerstore-controller-868c59b4ff-fxhc8 7/7 Running 0 8d powerstore-controller-868c59b4ff-pnldl 7/7 Running 0 8d powerstore-node-dfnnf 1/2 CrashLoopBackOff 8 (3m1s ago) 8d powerstore-node-jk8l2 1/2 CrashLoopBackOff 8 (3m6s ago) 8d

I have attached pod log for csi driver pod powerstore-node-dfnnf : "csi-pod-log" csi-pod-log.txt

Also attached pod description for application pod : "app-pod-des" app-pod-des.txt

satyakonduri commented 2 years ago

Hi @LeoHu1985,

Thank you for providing the logs. Can you please share the output for this nvme list-subsys -o json command from the nodes where you are seeing the crash. Thank You.

LeoHu1985 commented 2 years ago

@satyakonduri Here you go, output from all 3 nodes in the k8s cluster

k8s-0-b-1:/home/lab/applications/yugabyte # nvme list-subsys -o json { "Subsystems" : [ { "Name" : "nvme-subsys0", "NQN" : "nqn.2014-08.org.nvmexpress:uuid:526fee38-9395-d874-f1c9-1d1df9960db1", "Paths" : [ { "Name" : "nvme0", "Transport" : "pcie", "Address" : "0000:13:00.0", "State" : "live" } ] }, { "Name" : "nvme-subsys1", "NQN" : "nqn.1988-11.com.dell:powerstore:00:24eff00c8688DFE57BE0", "Paths" : [ { "Name" : "nvme1", "Transport" : "tcp", "Address" : "traddr=x.x.x.x trsvcid=4420", "State" : "live" }, { "Name" : "nvme2", "Transport" : "tcp", "Address" : "traddr=x.x.x.x trsvcid=4420", "State" : "live" } ] } ] }

k8s-0-b-2:~ # nvme list-subsys -o json { "Subsystems" : [ { "Name" : "nvme-subsys0", "NQN" : "x.x.x.x:uuid:526d57eb-b599-669b-3de3-9a045b5aae98", "Paths" : [ { "Name" : "nvme0", "Transport" : "pcie", "Address" : "0000:13:00.0", "State" : "live" } ] }, { "Name" : "nvme-subsys1", "NQN" : "nqn.1988-11.com.dell:powerstore:00:24eff00c8688DFE57BE0", "Paths" : [ { "Name" : "nvme1", "Transport" : "tcp", "Address" : "traddr=x.x.x.x trsvcid=4420", "State" : "live" }, { "Name" : "nvme2", "Transport" : "tcp", "Address" : "traddr=x.x.x.x trsvcid=4420", "State" : "live" } ] } ] }

k8s-0-b-3:~ # nvme list-subsys -o json { "Subsystems" : [ { "Name" : "nvme-subsys0", "NQN" : "x.x.x.x:uuid:5229a405-dfd5-775b-0c75-c2a46dda3 624", "Paths" : [ { "Name" : "nvme0", "Transport" : "pcie", "Address" : "0000:13:00.0", "State" : "live" } ] }, { "Name" : "nvme-subsys1", "NQN" : "x.x.x.x:powerstore:00:24eff00c8688DFE57BE0", "Paths" : [ { "Name" : "nvme1", "Transport" : "tcp", "Address" : "traddr=x.x.x.x trsvcid=4420", "State" : "live" }, { "Name" : "nvme2", "Transport" : "tcp", "Address" : "traddr=x.x.x.x trsvcid=4420", "State" : "live" } ] } ] }

LeoHu1985 commented 2 years ago

Again if you need to hop on my live environment to further check, feel free to let me know, I can do a zoom share.

gallacher commented 2 years ago

Hi @LeoHu1985. We'd be glad to work with further on this. Have you joined our Slack group? If not, please do so here: https://app.smartsheet.com/b/form/e99b4d2da13e42518df4d3307c010f47. We can then work directly with you there to further troubleshoot this issue. Thanks.

LeoHu1985 commented 2 years ago

@gallacher thanks, I have just submitted Slack group access request form, looking forward to working with you guys directly

gallacher commented 2 years ago

Hi @LeoHu1985, my apologies but we are experiencing a slight delay in processing our Slack access requests.

gallacher commented 2 years ago

@LeoHu1985, please send an email to karavi@dell.com to work with someone directly on this. Thanks!

spriya-m commented 2 years ago

PRs with fix for this issue have been merged. Please use csi-powerstore nightly image from docker hub.