Huawei / eSDK_K8S_Plugin

Container-Storage-Interface (CSI) for Huawei storage
https://huawei.github.io/css-docs/
Apache License 2.0
85 stars 47 forks source link

Problem when a pod is restarting for accessing PVC wiyth iscsi backend #133

Open ccaillet1974 opened 1 year ago

ccaillet1974 commented 1 year ago

Hi all,

I update a configuration on elasticsearch pod and this pod is restarting but it's stalled on Init:2 phase and the problem is due to CSI mounting problem

The error is logged in huawei-csi-node file on the worker node and is 👍

2023-06-09 12:17:49.103282 3234490 [INFO]: [requestID:3981693433] Gonna run shell cmd "ls -l /dev/mapper/ | grep -w dm-2".
2023-06-09 12:17:49.110290 3234490 [INFO]: [requestID:3981693433] Shell cmd "ls -l /dev/mapper/ | grep -w dm-2" result:
lrwxrwxrwx 1 root root       7 Jun  9 13:51 36a8ffba1005d5c41024401da0000000e -> ../dm-2

2023-06-09 12:17:49.110490 3234490 [ERROR]: [requestID:3981693433] Can not get DMDevice by alias: dm-2
2023-06-09 12:17:49.110561 3234490 [ERROR]: [requestID:3981693433] Get DMDevice by alias:dm-2 failed. error: Can not get DMDevice by alias: dm-2
2023-06-09 12:17:49.110624 3234490 [ERROR]: [requestID:3981693433] check device: dm-2 is a partition device failed. error: Get DMDevice by alias:dm-2 failed. error: Can not get DMDevice by alias: dm-2
2023-06-09 12:17:49.110697 3234490 [ERROR]: [requestID:3981693433] Run task  of taskflow StageVolume error: check device: dm-2 is a partition device failed. error: Get DMDevice by alias:dm-2 failed. error: Can not get DMDevice by alias: dm-2
2023-06-09 12:17:49.110775 3234490 [ERROR]: [requestID:3981693433] Stage volume pvc-70952380-f9fe-42e2-b272-65ed270e8005 error: check device: dm-2 is a partition device failed. error: Get DMDevice by alias:dm-2 failed. error: Can not get DMDevice by alias: dm-2

Here is the backend definition

storage: "oceanstor-san"
name: "oc6810-hdd-iscsi"
namespace: "kube-storage"
urls:
- "https://10.211.64.240:8088"
pools:
- "HDD001"
parameters:
  protocol: "iscsi"
  portals:
  - "10.144.69.5"
  - "10.144.69.6"
  - "10.144.69.7"
  - "10.144.69.8"
maxClientThreads: "30"

And the associated storageclass

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: oc6810-hdd-iscsi
provisioner: csi.huawei.com
parameters:
  backend: oc6810-hdd-iscsi
  allocType: thin
  volumeType: lun
  fsType: ext4
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: true

Worker node are deployed with Debian 11

The issue has been workaround with a drain on the impacted node.

I don't uderstand why it is stalled at this stage.

Any help would be appreciated

Regards