Azure / aksArc

# Welcome to the Azure Kubernetes Service enabled by Azure Arc (AKS Arc) repo This is where the AKS Arc team will track features and issues with AKS Arc. We will monitor this repo in order to engage with our community and discuss questions, customer scenarios, or feature requests. Checkout our projects tab to see the roadmap for AKS Arc!
MIT License
111 stars 45 forks source link

[BUG] After reboot AksHci can not reconnect to volume #186

Closed SantosVictorero closed 2 years ago

SantosVictorero commented 2 years ago

After a node reboot from Windows Updates the following error:

Cannot add 'X:\clusterstorage\volume1\xxx....vhdx'. The disk is already connected to the virtual machine 'xxx...'. (Virtual machine ID xxx...)]: Failed)

SantosVictorero commented 2 years ago

Also the csi-akshcicsi-controller is crashing with a volume error:

PS X:\Deployments\SFS> kubectl describe pod csi-akshcicsi-controller-5c7d5b49-v82xg -n kube-system Name: csi-akshcicsi-controller-5c7d5b49-v82xg Namespace: kube-system Priority: 2000000000 Priority Class Name: system-cluster-critical Node: moc-la1klirontk/192.168.5.128 Start Time: Thu, 12 May 2022 12:36:38 -0400 Labels: app=csi-akshcicsi-controller pod-template-hash=5c7d5b49 Annotations: scheduler.alpha.kubernetes.io/critical-pod: seccomp.security.alpha.kubernetes.io/pod: runtime/default Status: Running IP: 192.168.5.128 IPs: IP: 192.168.5.128 Controlled By: ReplicaSet/csi-akshcicsi-controller-5c7d5b49 Containers: csi-provisioner: Container ID: containerd://6a3740e4006d8186504218149ac78931f3c3a9716354074a95d615a42e0ab421 Image: mcr.microsoft.com/oss/kubernetes-csi/csi-provisioner:v2.2.2 Image ID: mcr.microsoft.com/oss/kubernetes-csi/csi-provisioner@sha256:9af3a83d8d6980c9bd774a7de0b5dd4fde39a49fa3e5cc6914316239c8d0ea57 Port: Host Port: Args: --feature-gates=Topology=true --csi-address=$(ADDRESS) --v=5 --timeout=15s --leader-election State: Terminated Reason: Error Exit Code: 255 Started: Thu, 12 May 2022 14:41:24 -0400 Finished: Thu, 12 May 2022 14:42:04 -0400 Last State: Terminated Reason: Error Exit Code: 255 Started: Thu, 12 May 2022 14:35:45 -0400 Finished: Thu, 12 May 2022 14:35:57 -0400 Ready: False Restart Count: 24 Environment: ADDRESS: /csi/csi.sock Mounts: /csi from socket-dir (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l2m6v (ro) csi-attacher: Container ID: containerd://695a2f425e1dcef1ec91b15cd5c16480bc214f39ed48122a64498c3c31cdd5c8 Image: mcr.microsoft.com/oss/kubernetes-csi/csi-attacher:v3.1.0 Image ID: mcr.microsoft.com/oss/kubernetes-csi/csi-attacher@sha256:d685e7294432c4308131d44781bb1e1f00ddd02d7a10f0d61caed5963ba7ae39 Port: Host Port: Args: -v=5 -csi-address=$(ADDRESS) -timeout=120s -leader-election State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 255 Started: Thu, 12 May 2022 14:38:22 -0400 Finished: Thu, 12 May 2022 14:39:17 -0400 Ready: False Restart Count: 26 Environment: ADDRESS: /csi/csi.sock Mounts: /csi from socket-dir (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l2m6v (ro) liveness-probe: Container ID: containerd://4804341c7f93a624e69a842c3d22bace21b769a0b869fff50666d5e7613eb806 Image: mcr.microsoft.com/oss/kubernetes-csi/livenessprobe:v2.4.0 Image ID: mcr.microsoft.com/oss/kubernetes-csi/livenessprobe@sha256:3770688d4efa33f8f284cc2ef58d33efa4ff606147e0a8de20e67458e13fedc2 Port: Host Port: Args: --csi-address=/csi/csi.sock --probe-timeout=3s --health-port=29602 --v=5 State: Running Started: Thu, 12 May 2022 12:37:15 -0400 Ready: True Restart Count: 0 Environment: Mounts: /csi from socket-dir (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l2m6v (ro) csi-resizer: Container ID: containerd://fff4f9387a87d1194ef78bf7d3a1280f6a7cecccdf82a1f3a89af2293dcfec00 Image: mcr.microsoft.com/oss/kubernetes-csi/csi-resizer:v1.1.0 Image ID: mcr.microsoft.com/oss/kubernetes-csi/csi-resizer@sha256:7997e0f236bc291101433ccd43704cac27ad87d6bdbcf480487d0799514aedc1 Port: Host Port: Args: -csi-address=$(ADDRESS) -v=5 -leader-election -handle-volume-inuse-error=true -timeout=180s State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 255 Started: Thu, 12 May 2022 14:38:41 -0400 Finished: Thu, 12 May 2022 14:39:16 -0400 Ready: False Restart Count: 25 Environment: ADDRESS: /csi/csi.sock Mounts: /csi from socket-dir (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l2m6v (ro) akshcicsi: Container ID: containerd://60325789af44ea745f2e31ae8fd0804254202c8ef49aa80af2194e65a196d08c Image: ecpacr.azurecr.io/akshcicsi:v1.0.6 Image ID: ecpacr.azurecr.io/akshcicsi@sha256:621074034131465bc0702a7e3762c2303993fd071e37661934fdd5a17e00ca3f Ports: 29602/TCP, 29604/TCP Host Ports: 29602/TCP, 29604/TCP Args: --v=5 --endpoint=$(CSI_ENDPOINT) --nodeid=$(KUBE_NODE_NAME) --alsologtostderr=true --metrics-address=0.0.0.0:29604 State: Running Started: Thu, 12 May 2022 12:37:43 -0400 Ready: True Restart Count: 0 Liveness: http-get http://:healthz/healthz delay=30s timeout=10s period=30s #success=1 #failure=5 Environment: CSI_ENDPOINT: unix:///csi/csi.sock WSSD_CONFIG_PATH: /etc/azhci/cloudconfig/value WSSD_LOGIN_CONF: /etc/azhci/wssdlogintoken/value WSSD_DEBUG_MODE: <set to the key 'WSSD_DEBUG_MODE' of config map 'moc-config'> Optional: false CLOUDAGENT_SERVER: <set to the key 'AZURESTACKHCI_CLOUDAGENT_FQDN' of config map 'moc-config'> Optional: false Mounts: /csi from socket-dir (rw) /etc/azhci/cloudconfig from cloudconfig (rw) /etc/azhci/wssdlogintoken from akshcicsiwssdlogintoken (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l2m6v (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: socket-dir: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: cloudconfig: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: akshcicsiwssdlogintoken: Type: Secret (a volume populated by a Secret) SecretName: akshcicsiwssdlogintoken Optional: true kube-api-access-l2m6v: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: kubernetes.io/os=linux Tolerations: CriticalAddonsOnly op=Exists node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message


Normal Scheduled 125m default-scheduler Successfully assigned kube-system/csi-akshcicsi-controller-5c7d5b49-v82xg to moc-la1klirontk Warning FailedMount 125m (x2 over 125m) kubelet MountVolume.SetUp failed for volume "kube-api-access-l2m6v" : failed to fetch token: serviceaccounts "csi-akshcicsi-controller-sa" is forbidden: User "system:node:moc-la1klirontk" cannot create resource "serviceaccounts/token" in API group "" in the namespace "kube-system": no relationship found between node 'moc-la1klirontk' and this object Normal Created 125m kubelet Created container csi-provisioner Normal Started 125m kubelet Started container csi-attacher Normal Pulled 125m kubelet Container image "mcr.microsoft.com/oss/kubernetes-csi/csi-attacher:v3.1.0" already present on machine Normal Pulled 125m kubelet Container image "mcr.microsoft.com/oss/kubernetes-csi/livenessprobe:v2.4.0" already present on machine Normal Created 125m kubelet Created container liveness-probe Normal Started 125m kubelet Started container liveness-probe Normal Pulled 125m kubelet Container image "ecpacr.azurecr.io/akshcicsi:v1.0.6" already present on machine Normal Created 124m kubelet Created container akshcicsi Normal Started 124m kubelet Started container akshcicsi Normal Pulled 123m (x2 over 125m) kubelet Container image "mcr.microsoft.com/oss/kubernetes-csi/csi-resizer:v1.1.0" already present on machine Normal Started 122m (x2 over 125m) kubelet Started container csi-resizer Warning BackOff 113m (x3 over 122m) kubelet Back-off restarting failed container Warning BackOff 85m (x80 over 122m) kubelet Back-off restarting failed container Normal Pulled 70m (x13 over 125m) kubelet Container image "mcr.microsoft.com/oss/kubernetes-csi/csi-provisioner:v2.2.2" already present on machine Normal Created 65m (x15 over 125m) kubelet Created container csi-attacher Normal Created 40m (x17 over 125m) kubelet Created container csi-resizer Normal Started 30m (x20 over 125m) kubelet Started container csi-provisioner Warning BackOff 5m46s (x420 over 120m) kubelet Back-off restarting failed container

SantosVictorero commented 2 years ago

I think I found the culprit of all my Aks Hci pod crash problems:

image

I will be replacing that RAID with high speed SSDs.