Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.93k stars 294 forks source link

[BUG] ECK (with PVC) operator node rollout: Unable to attach or mount volumes, code = DeadlineExceeded desc = context deadline exceeded, ... #3705

Open ondrejhlavacek opened 1 year ago

ondrejhlavacek commented 1 year ago

Describe the bug When rotating nodes with ECK operator (stateful set with PVCs) due to node pool upgrade a new pod on a new node is stuck in pending state. The disk is mounted to the node correctly and there it ends. More details in the screenshots.

I have performed the same operation on multiple clusters (in the same region) without any issues. This cluster's only difference is that it uses a Disk Encryption Set with key in another tenant. But other pods with PVCs on this affected cluster rotated correctly, so I don't suspect the encryption.

To Reproduce Not sure

Expected behavior Stateful set starts new nodes and attaches volumes.

Screenshots Overview of affected pod events

image

Full verbatim of some of the logs

MountVolume.MountDevice failed for volume "pvc-74258416-9c83-48f7-a4e8-f7f5745dfa8a" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
Unable to attach or mount volumes: unmounted volumes=[elasticsearch-data], unattached volumes=[elastic-internal-scripts elastic-internal-elasticsearch-config-local elastic-internal-unicast-hosts elastic-internal-remote-certificate-authorities elasticsearch-data elasticsearch-logs elastic-internal-http-certificates elastic-internal-secure-settings elastic-internal-transport-certificates elastic-internal-elasticsearch-bin-local tmp-volume elastic-internal-elasticsearch-config elastic-internal-probe-user elastic-internal-xpack-file-realm downward-api elastic-internal-elasticsearch-plugins-local]: timed out waiting for the condition
AttachVolume.Attach failed for volume "pvc-74258416-9c83-48f7-a4e8-f7f5745dfa8a" : timed out waiting for external-attacher of disk.csi.azure.com CSI driver to attach volume /subscriptions/5534680a-d596-4541-9833-7638bb4c7c9b/resourceGroups/mc_cloud-keboola-slsp_kbc-aks-logivyp3ismgw_westeurope/providers/Microsoft.Compute/disks/pvc-74258416-9c83-48f7-a4e8-f7f5745dfa8a

Node events

E0609 12:10:52.436830 1470 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/disk.csi.azure.com^/subscriptions/5534680a-d596-4541-9833-7638bb4c7c9b/resourceGroups/mc_cloud-keboola-slsp_kbc-aks-logivyp3ismgw_westeurope/providers/Microsoft.Compute/disks/pvc-74258416-9c83-48f7-a4e8-f7f5745dfa8a podName: nodeName:}" failed. No retries permitted until 2023-06-09 12:11:24.436766315 +0000 UTC m=+2551.006957642 (durationBeforeRetry 32s). Error: MountVolume.MountDevice failed for volume "pvc-74258416-9c83-48f7-a4e8-f7f5745dfa8a" (UniqueName: "kubernetes.io/csi/disk.csi.azure.com^/subscriptions/5534680a-d596-4541-9833-7638bb4c7c9b/resourceGroups/mc_cloud-keboola-slsp_kbc-aks-logivyp3ismgw_westeurope/providers/Microsoft.Compute/disks/pvc-74258416-9c83-48f7-a4e8-f7f5745dfa8a") pod "connection-elasticsearch-es-default-2" (UID: "ef8492d3-8539-41cc-9393-73aa55ab6fac") : rpc error: code = DeadlineExceeded desc = context deadline exceeded

CSI driver logs

I0609 11:29:14.158140       1 main.go:130] set up prometheus server on [::]:29605
I0609 11:29:14.167388       1 azuredisk.go:172] 
DRIVER INFORMATION:
-------------------
Build Date: "2023-05-12T06:02:15Z"
Compiler: gc
Driver Name: disk.csi.azure.com
Driver Version: v1.26.4
Git Commit: 80f3017aa65b951388eee0e446ba5d798d031985
Go Version: go1.20.4
Platform: linux/amd64
Topology Key: topology.disk.csi.azure.com/zone

Streaming logs below:
I0609 11:29:14.167406       1 azuredisk.go:175] driver userAgent: disk.csi.azure.com/v1.26.4
I0609 11:29:14.372803       1 azure_disk_utils.go:162] reading cloud config from secret kube-system/azure-cloud-provider
I0609 11:29:14.542440       1 azure_disk_utils.go:169] InitializeCloudFromSecret: failed to get cloud config from secret kube-system/azure-cloud-provider: failed to get secret kube-system/azure-cloud-provider: secrets "azure-cloud-provider" is forbidden: User "system:serviceaccount:kube-system:csi-azuredisk-node-sa" cannot get resource "secrets" in API group "" in the namespace "kube-system"
I0609 11:29:14.542461       1 azure_disk_utils.go:174] could not read cloud config from secret kube-system/azure-cloud-provider
I0609 11:29:14.542469       1 azure_disk_utils.go:177] AZURE_CREDENTIAL_FILE env var set as /etc/kubernetes/azure.json
I0609 11:29:14.542494       1 azure_disk_utils.go:192] read cloud config from file: /etc/kubernetes/azure.json successfully
I0609 11:29:14.542972       1 azure_auth.go:253] Using AzurePublicCloud environment
I0609 11:29:14.543013       1 azure_auth.go:104] azure: using managed identity extension to retrieve access token
I0609 11:29:14.543023       1 azure_auth.go:110] azure: using User Assigned MSI ID to retrieve access token
I0609 11:29:14.543055       1 azure_auth.go:121] azure: User Assigned MSI ID is client ID
I0609 11:29:14.543081       1 azure.go:773] Azure cloudprovider using try backoff: retries=6, exponent=1.500000, duration=5, jitter=1.000000
I0609 11:29:14.590386       1 azure_interfaceclient.go:74] Azure InterfacesClient (read ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.590405       1 azure_interfaceclient.go:77] Azure InterfacesClient (write ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.590423       1 azure_vmsizeclient.go:68] Azure VirtualMachineSizesClient (read ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.590429       1 azure_vmsizeclient.go:71] Azure VirtualMachineSizesClient (write ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.608278       1 azure_storageaccountclient.go:70] Azure StorageAccountClient (read ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.608297       1 azure_storageaccountclient.go:73] Azure StorageAccountClient (write ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.649775       1 azure_diskclient.go:68] Azure DisksClient using API version: 2022-03-02
I0609 11:29:14.668116       1 azure_vmclient.go:70] Azure VirtualMachine client (read ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668135       1 azure_vmclient.go:73] Azure VirtualMachine client (write ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668147       1 azure_vmssclient.go:70] Azure VirtualMachineScaleSetClient (read ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668153       1 azure_vmssclient.go:73] Azure VirtualMachineScaleSetClient (write ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668164       1 azure_vmssvmclient.go:75] Azure vmssVM client (read ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668174       1 azure_vmssvmclient.go:78] Azure vmssVM client (write ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668185       1 azure_routeclient.go:69] Azure RoutesClient (read ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668191       1 azure_routeclient.go:72] Azure RoutesClient (write ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668216       1 azure_subnetclient.go:70] Azure SubnetsClient (read ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668225       1 azure_subnetclient.go:73] Azure SubnetsClient (write ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668235       1 azure_routetableclient.go:69] Azure RouteTablesClient (read ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668241       1 azure_routetableclient.go:72] Azure RouteTablesClient (write ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668257       1 azure_loadbalancerclient.go:70] Azure LoadBalancersClient (read ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668264       1 azure_loadbalancerclient.go:73] Azure LoadBalancersClient (write ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668290       1 azure_securitygroupclient.go:70] Azure SecurityGroupsClient (read ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668297       1 azure_securitygroupclient.go:73] Azure SecurityGroupsClient (write ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668313       1 azure_publicipclient.go:74] Azure PublicIPAddressesClient (read ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668324       1 azure_publicipclient.go:77] Azure PublicIPAddressesClient (write ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668338       1 azure_blobclient.go:73] Azure BlobClient using API version: 2021-09-01
I0609 11:29:14.668363       1 azure_vmasclient.go:70] Azure AvailabilitySetsClient (read ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.668374       1 azure_vmasclient.go:73] Azure AvailabilitySetsClient  (write ops) using rate limit config: QPS=10, bucket=100
I0609 11:29:14.819919       1 azure.go:1004] attach/detach disk operation rate limit QPS: 6.000000, Bucket: 10
I0609 11:29:14.819952       1 skus.go:121] NewNodeInfo: Starting to populate node and disk sku information.
I0609 11:29:15.316932       1 mount_linux.go:284] Detected umount with safe 'not mounted' behavior
I0609 11:29:15.316991       1 driver.go:81] Enabling controller service capability: CREATE_DELETE_VOLUME
I0609 11:29:15.316999       1 driver.go:81] Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME
I0609 11:29:15.317004       1 driver.go:81] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0609 11:29:15.317009       1 driver.go:81] Enabling controller service capability: CLONE_VOLUME
I0609 11:29:15.317032       1 driver.go:81] Enabling controller service capability: EXPAND_VOLUME
I0609 11:29:15.317035       1 driver.go:81] Enabling controller service capability: SINGLE_NODE_MULTI_WRITER
I0609 11:29:15.317038       1 driver.go:100] Enabling volume access mode: SINGLE_NODE_WRITER
I0609 11:29:15.317041       1 driver.go:100] Enabling volume access mode: SINGLE_NODE_READER_ONLY
I0609 11:29:15.317043       1 driver.go:100] Enabling volume access mode: SINGLE_NODE_SINGLE_WRITER
I0609 11:29:15.317045       1 driver.go:100] Enabling volume access mode: SINGLE_NODE_MULTI_WRITER
I0609 11:29:15.317047       1 driver.go:91] Enabling node service capability: STAGE_UNSTAGE_VOLUME
I0609 11:29:15.317050       1 driver.go:91] Enabling node service capability: EXPAND_VOLUME
I0609 11:29:15.317052       1 driver.go:91] Enabling node service capability: GET_VOLUME_STATS
I0609 11:29:15.317054       1 driver.go:91] Enabling node service capability: SINGLE_NODE_MULTI_WRITER
I0609 11:29:15.317191       1 server.go:117] Listening for connections on address: &net.UnixAddr{Name:"//csi/csi.sock", Net:"unix"}
I0609 11:29:15.711213       1 utils.go:77] GRPC call: /csi.v1.Identity/GetPluginInfo
I0609 11:29:15.711233       1 utils.go:78] GRPC request: {}
I0609 11:29:15.712855       1 utils.go:84] GRPC response: {"name":"disk.csi.azure.com","vendor_version":"v1.26.4"}
I0609 11:29:15.911067       1 utils.go:77] GRPC call: /csi.v1.Identity/GetPluginInfo
I0609 11:29:15.911086       1 utils.go:78] GRPC request: {}
I0609 11:29:15.911134       1 utils.go:84] GRPC response: {"name":"disk.csi.azure.com","vendor_version":"v1.26.4"}
I0609 11:29:16.581581       1 utils.go:77] GRPC call: /csi.v1.Node/NodeGetInfo
I0609 11:29:16.581598       1 utils.go:78] GRPC request: {}
I0609 11:29:16.581649       1 azure_zones.go:165] Availability zone is not enabled for the node, falling back to fault domain
I0609 11:29:16.581659       1 nodeserver.go:350] NodeGetInfo, nodeName: aks-eck-29741929-vmss000016, failureDomain: 0
I0609 11:29:16.581672       1 nodeserver.go:408] got a matching size in getMaxDataDiskCount, VM Size: STANDARD_D4AS_V5, MaxDataDiskCount: 8
I0609 11:29:16.581686       1 utils.go:84] GRPC response: {"accessible_topology":{"segments":{"topology.disk.csi.azure.com/zone":""}},"max_volumes_per_node":8,"node_id":"aks-eck-29741929-vmss000016"}
I0609 11:56:20.725828       1 utils.go:77] GRPC call: /csi.v1.Node/NodeStageVolume
I0609 11:56:20.725844       1 utils.go:78] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/bd19a83de909b4e2cb0fdb6d877b4690d3d67f80d507d8142091e4b4d934b2ce/globalmount","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_context":{"cachingmode":"ReadOnly","csi.storage.k8s.io/pv/name":"pvc-74258416-9c83-48f7-a4e8-f7f5745dfa8a","csi.storage.k8s.io/pvc/name":"elasticsearch-data-connection-elasticsearch-es-default-2","csi.storage.k8s.io/pvc/namespace":"connection-elasticsearch","kind":"Managed","networkAccessPolicy":"DenyAll","requestedsizegib":"256","storage.kubernetes.io/csiProvisionerIdentity":"1681298424929-8081-disk.csi.azure.com","storageaccounttype":"Premium_LRS"},"volume_id":"/subscriptions/5534680a-d596-4541-9833-7638bb4c7c9b/resourceGroups/mc_cloud-keboola-slsp_kbc-aks-logivyp3ismgw_westeurope/providers/Microsoft.Compute/disks/pvc-74258416-9c83-48f7-a4e8-f7f5745dfa8a"}
E0609 11:58:20.733953       1 utils.go:82] GRPC error: rpc error: code = Internal desc = failed to find disk on lun 0. timed out waiting for the condition
I0609 11:58:21.242307       1 utils.go:77] GRPC call: /csi.v1.Node/NodeStageVolume
I0609 11:58:21.242320       1 utils.go:78] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/bd19a83de909b4e2cb0fdb6d877b4690d3d67f80d507d8142091e4b4d934b2ce/globalmount","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_context":{"cachingmode":"ReadOnly","csi.storage.k8s.io/pv/name":"pvc-74258416-9c83-48f7-a4e8-f7f5745dfa8a","csi.storage.k8s.io/pvc/name":"elasticsearch-data-connection-elasticsearch-es-default-2","csi.storage.k8s.io/pvc/namespace":"connection-elasticsearch","kind":"Managed","networkAccessPolicy":"DenyAll","requestedsizegib":"256","storage.kubernetes.io/csiProvisionerIdentity":"1681298424929-8081-disk.csi.azure.com","storageaccounttype":"Premium_LRS"},"volume_id":"/subscriptions/5534680a-d596-4541-9833-7638bb4c7c9b/resourceGroups/mc_cloud-keboola-slsp_kbc-aks-logivyp3ismgw_westeurope/providers/Microsoft.Compute/disks/pvc-74258416-9c83-48f7-a4e8-f7f5745dfa8a"}
E0609 12:00:21.248943       1 utils.go:82] GRPC error: rpc error: code = Internal desc = failed to find disk on lun 0. timed out waiting for the condition
I0609 12:00:22.258908       1 utils.go:77] GRPC call: /csi.v1.Node/NodeStageVolume
I0609 12:00:22.258921       1 utils.go:78] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/bd19a83de909b4e2cb0fdb6d877b4690d3d67f80d507d8142091e4b4d934b2ce/globalmount","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_context":{"cachingmode":"ReadOnly","csi.storage.k8s.io/pv/name":"pvc-74258416-9c83-48f7-a4e8-f7f5745dfa8a","csi.storage.k8s.io/pvc/name":"elasticsearch-data-connection-elasticsearch-es-default-2","csi.storage.k8s.io/pvc/namespace":"connection-elasticsearch","kind":"Managed","networkAccessPolicy":"DenyAll","requestedsizegib":"256","storage.kubernetes.io/csiProvisionerIdentity":"1681298424929-8081-disk.csi.azure.com","storageaccounttype":"Premium_LRS"},"volume_id":"/subscriptions/5534680a-d596-4541-9833-7638bb4c7c9b/resourceGroups/mc_cloud-keboola-slsp_kbc-aks-logivyp3ismgw_westeurope/providers/Microsoft.Compute/disks/pvc-74258416-9c83-48f7-a4e8-f7f5745dfa8a"}
E0609 12:02:22.266748       1 utils.go:82] GRPC error: rpc error: code = Internal desc = failed to find disk on lun 0. timed out waiting for the condition
...

Mounts in the CSI driver container (the PVC is not mounted)

/ # mount
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/410/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/409/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/408/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/558/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/558/work)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
/dev/sda1 on /csi type ext4 (rw,relatime,discard,errors=remount-ro)
devtmpfs on /dev type devtmpfs (rw,relatime,size=8188968k,nr_inodes=2047242,mode=755,inode64)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,inode64)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
/dev/sda1 on /etc/kubernetes type ext4 (rw,relatime,discard,errors=remount-ro)
/dev/sda1 on /etc/hosts type ext4 (rw,relatime,discard,errors=remount-ro)
/dev/sda1 on /dev/termination-log type ext4 (rw,relatime,discard,errors=remount-ro)
/dev/sda1 on /etc/hostname type ext4 (rw,relatime,discard,errors=remount-ro)
/dev/sda1 on /etc/resolv.conf type ext4 (rw,relatime,discard,errors=remount-ro)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k,inode64)
/dev/sda1 on /var/lib/kubelet type ext4 (rw,relatime,discard,errors=remount-ro)
tmpfs on /var/lib/kubelet/pods/a882e135-5ddb-4610-bbed-647a7cdfad82/volumes/kubernetes.io~secret/kbc-connection-elasticsearch-cert type tmpfs (rw,relatime,size=12892984k,inode64)
tmpfs on /var/lib/kubelet/pods/58c0f6ab-b0ff-4327-97e4-621f635d67d5/volumes/kubernetes.io~projected/azure-ip-masq-agent-config-volume type tmpfs (rw,relatime,size=256000k,inode64)
tmpfs on /var/lib/kubelet/pods/e7247e85-35e0-46b8-a850-af93b61d7a91/volumes/kubernetes.io~projected/kube-api-access-f6ftr type tmpfs (rw,relatime,size=12892984k,inode64)
tmpfs on /var/lib/kubelet/pods/a882e135-5ddb-4610-bbed-647a7cdfad82/volumes/kubernetes.io~projected/kube-api-access-p889c type tmpfs (rw,relatime,size=12892984k,inode64)
tmpfs on /var/lib/kubelet/pods/1832abdd-1195-47e7-a2af-db30e79bae65/volumes/kubernetes.io~projected/kube-api-access-dq9jd type tmpfs (rw,relatime,size=409600k,inode64)
tmpfs on /var/lib/kubelet/pods/03deb167-d134-4993-8101-a7b9f806bf96/volumes/kubernetes.io~projected/kube-api-access-29n7x type tmpfs (rw,relatime,size=2097152k,inode64)
tmpfs on /var/lib/kubelet/pods/58c0f6ab-b0ff-4327-97e4-621f635d67d5/volumes/kubernetes.io~projected/kube-api-access-h2wxg type tmpfs (rw,relatime,size=256000k,inode64)
tmpfs on /var/lib/kubelet/pods/e2ee181c-e560-4b2a-82f6-3561efac6aeb/volumes/kubernetes.io~projected/kube-api-access-9q9r6 type tmpfs (rw,relatime,size=524288k,inode64)
tmpfs on /var/lib/kubelet/pods/5e05c92b-5f00-4b4b-b497-7ed2df94d938/volumes/kubernetes.io~projected/kube-api-access-248kt type tmpfs (rw,relatime,size=614400k,inode64)
tmpfs on /var/lib/kubelet/pods/6443054e-b3ef-4652-8f12-fc1fc83d7af1/volumes/kubernetes.io~projected/kube-api-access-g559w type tmpfs (rw,relatime,size=409600k,inode64)
sysfs on /sys/class/scsi_host type sysfs (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys/bus/scsi/devices type sysfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /run/secrets/kubernetes.io/serviceaccount type tmpfs (ro,relatime,size=409600k,inode64)
tmpfs on /var/lib/kubelet/pods/9cedbaa9-5ce9-461c-b13a-5fd6623ffe6f/volumes/kubernetes.io~projected/kube-api-access-cr86h type tmpfs (rw,relatime,size=12892984k,inode64)
/dev/sda1 on /var/lib/kubelet/pods/03deb167-d134-4993-8101-a7b9f806bf96/volume-subpaths/config/fluentd/2 type ext4 (rw,relatime,discard,errors=remount-ro)
/dev/sda1 on /var/lib/kubelet/pods/a882e135-5ddb-4610-bbed-647a7cdfad82/volume-subpaths/installinfo/agent/0 type ext4 (rw,relatime,discard,errors=remount-ro)
tmpfs on /var/lib/kubelet/pods/ef8492d3-8539-41cc-9393-73aa55ab6fac/volumes/kubernetes.io~secret/elastic-internal-xpack-file-realm type tmpfs (rw,relatime,size=10485760k,inode64)
tmpfs on /var/lib/kubelet/pods/ef8492d3-8539-41cc-9393-73aa55ab6fac/volumes/kubernetes.io~secret/elastic-internal-transport-certificates type tmpfs (rw,relatime,size=10485760k,inode64)
tmpfs on /var/lib/kubelet/pods/ef8492d3-8539-41cc-9393-73aa55ab6fac/volumes/kubernetes.io~secret/elastic-internal-remote-certificate-authorities type tmpfs (rw,relatime,size=10485760k,inode64)
tmpfs on /var/lib/kubelet/pods/ef8492d3-8539-41cc-9393-73aa55ab6fac/volumes/kubernetes.io~secret/elastic-internal-secure-settings type tmpfs (rw,relatime,size=10485760k,inode64)
tmpfs on /var/lib/kubelet/pods/ef8492d3-8539-41cc-9393-73aa55ab6fac/volumes/kubernetes.io~secret/elastic-internal-probe-user type tmpfs (rw,relatime,size=10485760k,inode64)
tmpfs on /var/lib/kubelet/pods/ef8492d3-8539-41cc-9393-73aa55ab6fac/volumes/kubernetes.io~secret/elastic-internal-http-certificates type tmpfs (rw,relatime,size=10485760k,inode64)
tmpfs on /var/lib/kubelet/pods/ef8492d3-8539-41cc-9393-73aa55ab6fac/volumes/kubernetes.io~secret/elastic-internal-elasticsearch-config type tmpfs (rw,relatime,size=10485760k,inode64)
tmpfs on /var/lib/kubelet/pods/ef8492d3-8539-41cc-9393-73aa55ab6fac/volumes/kubernetes.io~downward-api/downward-api type tmpfs (rw,relatime,size=10485760k,inode64)

Volume is attached to the node

image

The node is in failed state (if I drain the node to move the pod to a new node the same happens to new node as well)

image

Node status. The node is in Running state until the pod is scheduled to the instance. Then it is in Updating state and finishes in Failed. Happened to 4 nodes so far with this pod/statefulset. If I do not schedule the pod on the node it will stay in Running state.

image

Environment (please complete the following information):

andyzhangx commented 1 year ago

I found there is following error, looks like there is sth. wrong with your DiskEncryptionSet and Keyvault

I0609 08:34:48.621070       1 azure_armclient.go:291] Received error in WaitForAsyncOperationCompletion: 'Code="StorageFailure/KeyVaultAccessTokenCannotBeAcquired" Message="Error while creating storage object https://xxx.z43.blob.storage.azure.net/bzrwqstxnzbf/abcd  Target: '/subscriptions/xxx/resourceGroups/mc_cloud-keboola-slsp_kbc-aks-logivyp3ismgw_westeurope/providers/Microsoft.Compute/disks/pvc-74258416-9c83-48f7-a4e8-f7f5745dfa8a'." Target="36"'

E0609 08:34:48.621262       1 controllerserver.go:440] Attach volume /subscriptions/xxx/resourceGroups/mc_cloud-keboola-slsp_kbc-aks-logivyp3ismgw_westeurope/providers/Microsoft.Compute/disks/pvc-74258416-9c83-48f7-a4e8-f7f5745dfa8a to instance aks-eck-29741929-vmss000010 failed with Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: Code="StorageFailure/KeyVaultAccessTokenCannotBeAcquired" Message="Error while creating storage object https://xxx.z43.blob.storage.azure.net/bzrwqstxnzbf/abcd  Target: '/subscriptions/xxx/resourceGroups/mc_cloud-keboola-slsp_kbc-aks-logivyp3ismgw_westeurope/providers/Microsoft.Compute/disks/pvc-74258416-9c83-48f7-a4e8-f7f5745dfa8a'." Target="36"
ondrejhlavacek commented 1 year ago

That is interesting, all PVCs in the cluster are using CMK (via Disk Encryption Set) and other PVCs were moved without any issue. Where did you find this error?

andyzhangx commented 1 year ago

I could search the error from azure disk csi driver controller, it's managed by aks.

ghost commented 1 year ago

Action required from @Azure/aks-pm

olsenme commented 1 year ago

did the key by any chance get deleted?

microsoft-github-policy-service[bot] commented 5 months ago

Action required from @Azure/aks-pm

microsoft-github-policy-service[bot] commented 4 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 4 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 3 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 3 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 2 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 2 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 1 month ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 1 month ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 3 weeks ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 1 week ago

Issue needing attention of @Azure/aks-leads