[BUG] configMap volumeMount -- failed to create containerd container: rootpath on mountPath ... The system cannot find the path specified.

mbelt commented 2 years ago

Describe the bug After upgrading from 1.22.6 to 1.23.8 az aks upgrade ... --kubernetes-version 1.23.8 our Windows nodepool made the jump from dockerd to containerd.

Afterwards multiple containers fail to start, csi-drivers on windows nodepools are restarting with errors.

First symptom:

ContainerError while setting up mounted configMap.

...
volumeMounts:                                                                                                                                                                                         
 - mountPath: /config                                                                                                                                                                                  
   name: configs   
...
volumes:                                                                                                                                                                                                                                                                                                                                                                      
 - configMap:                                                                                                                                                                                            
     defaultMode: 420                                                                                                                                                                                    
     name: redacted.configs                                                                                                                                                                 
   name: configs

Warning  Failed   kubelet    Error: failed to create containerd container: rootpath on mountPath C:\Windows\TEMP\ctd-volume3189089798/620, volume \config: readlink C:\Windows\TEMP\ctd-volume3189089798/620: The system cannot find the path specified.

On the node, the directory C:\Windows\TEMP\ctd-volumectd-volume3189089798/620 does exist, a symlink to \\?\Volume{guid}, but its contents could not be listed.

Attempted to mitigate the issue by:

Removing all specs in namespace and redeploying.
Forcing pod to a different node.
Restarting nodepool VMSS.

No success

Second Symptom

Errors and restarts of csi-azurefile-node-win-xxxx and csi-azuredisk-node-win-xxxx

Excerpts from csi-azuredisk-node-win:

azure_disk_utils.go:166] InitializeCloudFromSecret: failed to get cloud config from secret kube-system/azure-cloud-provider: failed to get secret kube-system/azure-cloud-provider:
azure_disk_utils.go:171] could not read cloud config from secret kube-system/azure-cloud-provider                                                                                  
azure_auth.go:113] azure: User Assigned MSI ID is client ID. Resource ID parsing error: %+vparsing failed for 8704bcea-5513-4b7c-99f2-dd8e266cb897. Invalid resource Id format     
safe_mounter_windows.go:410] failed to connect to csi-proxy v1 with error: open \\.\\pipe\\csi-proxy-filesystem-v1: The system cannot find the file specified., will try with v1Bet

Full log

I0803 23:29:04.578169   16040 main.go:128] set up prometheus server on :29605

I0803 23:29:04.829625   16040 azuredisk.go:168]
DRIVER INFORMATION:
-------------------
Build Date: "2022-06-10T08:56:40Z"
Compiler: gc
Driver Name: disk.csi.azure.com
Driver Version: v1.19.0
Git Commit: 538a8cec600350de55e42ec14ebbd8733bb4b498                                                                                                                                                             Go Version: go1.18.1
Platform: windows/amd64
Topology Key: topology.disk.csi.azure.com/zone

Streaming logs below:                                                                                                                                                                                            
I0803 23:29:04.830161   16040 azuredisk.go:171] driver userAgent: disk.csi.azure.com/v1.19.0 gc/go1.18.1 (amd64-windows)                                                                                         
I0803 23:29:04.897621   16040 azure_disk_utils.go:159] reading cloud config from secret kube-system/azure-cloud-provider                                                                                         
I0803 23:29:05.011109   16040 azure_disk_utils.go:166] InitializeCloudFromSecret: failed to get cloud config from secret kube-system/azure-cloud-provider: failed to get secret kube-system/azure-cloud-provider:
I0803 23:29:05.011109   16040 azure_disk_utils.go:171] could not read cloud config from secret kube-system/azure-cloud-provider                                                                                  
I0803 23:29:05.011109   16040 azure_disk_utils.go:174] AZURE_CREDENTIAL_FILE env var set as C:\k\azure.json                                                                                                      
I0803 23:29:05.015493   16040 azure_disk_utils.go:189] read cloud config from file: C:\k\azure.json successfully                                                                                                 
I0803 23:29:05.016553   16040 azure_auth.go:245] Using AzurePublicCloud environment                                                                                                                              
I0803 23:29:05.017153   16040 azure_auth.go:96] azure: using managed identity extension to retrieve access token                                                                                                 
I0803 23:29:05.017153   16040 azure_auth.go:102] azure: using User Assigned MSI ID to retrieve access token                                                                                                      
I0803 23:29:05.017153   16040 azure_auth.go:113] azure: User Assigned MSI ID is client ID. Resource ID parsing error: %+vparsing failed for 8704bcea-5513-4b7c-99f2-dd8e266cb897. Invalid resource Id format     
I0803 23:29:05.018225   16040 azure_diskclient.go:68] Azure DisksClient using API version: 2021-04-01                                                                                                            
I0803 23:29:05.019870   16040 azure.go:993] attach/detach disk operation rate limit QPS: 6.000000, Bucket: 10                                                                                                    
I0803 23:29:05.020401   16040 safe_mounter_windows.go:410] failed to connect to csi-proxy v1 with error: open \\.\\pipe\\csi-proxy-filesystem-v1: The system cannot find the file specified., will try with v1Bet
I0803 23:29:05.048267   16040 safe_mounter_windows.go:415] using CSIProxyMounterV1beta, API Versions filesystem: v1beta1, disk: v1beta2, volume: v1beta2                                                         
I0803 23:29:05.048436   16040 driver.go:81] Enabling controller service capability: CREATE_DELETE_VOLUME                                                                                                         
I0803 23:29:05.048436   16040 driver.go:81] Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME                                                                                                     
I0803 23:29:05.048436   16040 driver.go:81] Enabling controller service capability: CREATE_DELETE_SNAPSHOT                                                                                                       
I0803 23:29:05.048436   16040 driver.go:81] Enabling controller service capability: CLONE_VOLUME                                                                                                                 
I0803 23:29:05.048436   16040 driver.go:81] Enabling controller service capability: EXPAND_VOLUME                                                                                                                
I0803 23:29:05.048436   16040 driver.go:81] Enabling controller service capability: SINGLE_NODE_MULTI_WRITER                                                                                                     
I0803 23:29:05.048436   16040 driver.go:100] Enabling volume access mode: SINGLE_NODE_WRITER                                                                                                                     
I0803 23:29:05.048436   16040 driver.go:100] Enabling volume access mode: SINGLE_NODE_READER_ONLY                                                                                                                
I0803 23:29:05.048436   16040 driver.go:100] Enabling volume access mode: SINGLE_NODE_SINGLE_WRITER                                                                                                              
I0803 23:29:05.048436   16040 driver.go:100] Enabling volume access mode: SINGLE_NODE_MULTI_WRITER                                                                                                               
I0803 23:29:05.048436   16040 driver.go:91] Enabling node service capability: STAGE_UNSTAGE_VOLUME                                                                                                               
I0803 23:29:05.048436   16040 driver.go:91] Enabling node service capability: EXPAND_VOLUME                                                                                                                      
I0803 23:29:05.048436   16040 driver.go:91] Enabling node service capability: GET_VOLUME_STATS                                                                                                                   
I0803 23:29:05.048436   16040 driver.go:91] Enabling node service capability: SINGLE_NODE_MULTI_WRITER                                                                                                           
I0803 23:29:05.051847   16040 server.go:117] Listening for connections on address: &net.UnixAddr{Name:"C:\\\\csi\\\\csi.sock", Net:"unix"}                                                                       
I0803 23:29:05.401222   16040 utils.go:77] GRPC call: /csi.v1.Identity/GetPluginInfo                                                                                                                             
I0803 23:29:05.401222   16040 utils.go:78] GRPC request: {}                                                                                                                                                      
I0803 23:29:05.405892   16040 utils.go:84] GRPC response: {"name":"disk.csi.azure.com","vendor_version":"v1.19.0"}                                                                                               
I0803 23:29:06.034441   16040 utils.go:77] GRPC call: /csi.v1.Identity/GetPluginInfo                                                                                                                             
I0803 23:29:06.034441   16040 utils.go:78] GRPC request: {}                                                                                                                                                      
I0803 23:29:06.034441   16040 utils.go:84] GRPC response: {"name":"disk.csi.azure.com","vendor_version":"v1.19.0"}                                                                                               
I0803 23:30:36.594569   16040 utils.go:77] GRPC call: /csi.v1.Identity/GetPluginInfo                                                                                                                             
I0803 23:30:36.594569   16040 utils.go:78] GRPC request: {}                                                                                            
I0803 23:30:37.430621   16040 nodeserver.go:409] got a matching size in getMaxDataDiskCount, VM Size: STANDARD_D4A_V4, MaxDataDiskCount: 8 
I0803 23:30:37.430621   16040 utils.go:84] GRPC response: {"accessible_topology":{"segments":{"topology.disk.csi.azure.com/zone":"eastus2-2","topology.kubernetes.io/zone":"eastus2-2"}},"max_volumes_per_node":8
Stream closed EOF for kube-system/csi-azuredisk-node-win-pcgjh (azuredisk)

The csi-azurefile-node-win pod has the same errors. The csi-azurexxxx-node pods on the linux VMSS also have the error about failing to parse the resource ID, but none of the other errors.

Third symptom

An pod from a different deployment is failing to start after the upgrade.

Warning  Failed          51m (x4 over 53m)     kubelet            Error: failed to start containerd task "REDACT": hcs::System::CreateProcess REDACT: The system cannot find the file specified.: unknown    
...
Created container REDACT
...
Warning  Failed     34m (x2 over 35m)     kubelet     Error: failed to start containerd task "REDACT": hcs::System::Start REDACT: The virtual machine or container exited unexpectedly.: unknown

These errors all relate to storage driver on the node in some way, so I have grouped them in a single bug report.

Environment

Kubernetes version 1.23.8
Windows VMSS SKU Standard_D4a_v4
RBAC Enabled
VMSS and APIServer using user-assigned MSI.

andyzhangx commented 2 years ago

@mbelt

about the resourceID format issue, right format is like this (?i)subscriptions/(.+)/resourceGroups/(.+)/providers/(.+?)/(.+?)/(.+)
about the Third symptom: could vmss instance reimage mitigate this issue?

MageshSrinivasulu commented 2 years ago

I am also facing a similar issue after upgrading to 1.23.5 from 1.22.11. Where windows pods are not getting started and runs into an infinite crash loop with the below errors

MountVolume.SetUp failed for volume "configuration" : failed to sync configmap cache: timed out waiting for the condition
Error: failed to start containerd task "sm": hcs::System::Start sm: The virtual machine or container exited unexpectedly.: unknown
kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name file.csi.azure.com not found in the list of registered CSI drivers

MageshSrinivasulu commented 2 years ago

@andyzhangx I tried re-imaging the vmss instances as well. But it's of no use.

mbelt commented 2 years ago

@mbelt

about the resourceID format issue, right format is like this (?i)subscriptions/(.+)/resourceGroups/(.+)/providers/(.+?)/(.+?)/(.+)

about the Third symptom: could vmss instance reimage mitigate this issue?

@andyzhangx

The resourceID of the node MSI is formatted (?i)subscriptions/(.+)/resourceGroups/(.+)/providers/Microsoft.ManagedIdentity/userAssignedIdentities/(.+) but more to the point, I am not providing any information about this identity, the driver is getting it from IMDS and the MSI was working as intended before upgrading.

I have since created a fresh cluster in a new resource group in a different region and started out with version 1.23.5 and this same error message occurs.

andyzhangx commented 2 years ago

@AbelHu do you have any insight about windows container creation failure on windows node with below error:

Warning  Failed          51m (x4 over 53m)     kubelet            Error: failed to start containerd task "REDACT": hcs::System::CreateProcess REDACT: The system cannot find the file specified.: unknown    
...
Created container REDACT
...
Warning  Failed     34m (x2 over 35m)     kubelet     Error: failed to start containerd task "REDACT": hcs::System::Start REDACT: The virtual machine or container exited unexpectedly.: unknown

AbelHu commented 2 years ago

@andyzhangx it does not contain enough info. Suggest collecting full nodes logs for advanced investigation. You may find the clue in kubelet logs

andyzhangx commented 2 years ago

@mbelt if you are using AKS managed CSI driver, the identity CSI driver using is Control plane identity. Have you changed the /etc/kubernetes/azure.json file on every agent node? The right way is bring your own identity for the while control plane, follow guide here: https://docs.microsoft.com/en-us/azure/aks/use-managed-identity#summary-of-managed-identities

For the first and third issues, it's all related to windows container creation failure on Windows node, pls file an azure support ticket.

mbelt commented 2 years ago

@andyzhangx Ack on one and three.

For the second issue, no I haven't touched /etc/kubernetes/azure.json. My mistake claiming it was using node identity. This issue is occurring on a fresh cluster with no k8s specs deployed and no changes made to the nodes.

mbelt commented 2 years ago

Kubelet logs contain one additional piece of information.

On nodes where the failing Pod is placed I am seeing: A virtual machine or container with the specified identifier does not exist.\" name=HcsOpenComputeSystem

containerd.err-20220809T161122.479.log:20084:time="2022-08-09T16:10:18.544895100Z" level=warning msg="cleanup warnings time=\"2022-08-09T16:10:18Z\" level=error msg=Span duration=3.489ms endTime=\"2022-08-09 16:10:18.4924774 +0000 GMT m=+0.008302101\" error=\"A virtual machine or container with the specified identifier does not exist.\" name=HcsOpenComputeSystem parentSpanID=5a4e85ad2b823f08 spanID=d7514362e2be6193 startTime=\"2022-08-09 16:10:18.4889884 +0000 GMT
m=+0.004813101\" traceID=b04266fb0a7eb50b6ebad34de484f304\ntime=\"2022-08-09T16:10:18Z\" level=info msg=Span duration=30.4596ms endTime=\"2022-08-09 16:10:18.5189078 +0000 GMT m=+0.034732601\" name=delete parentSpanID=0000000000000000 spanID=5a4e85ad2b823f08 startTime=\"2022-08-09 16:10:18.4884482 +0000 GMT m=+0.004273001\" traceID=b04266fb0a7eb50b6ebad34de484f304\n"

containerd.err-20220809T161122.479.log:20180:time="2022-08-09T16:10:19.943501700Z" level=error msg="CreateContainer within sandbox \"667d46e20c142094683b151d9f9b9e4b5b14612bb481237a167f16afd1d51b5b\" for &ContainerMetadata{Name:qas,Attempt:0,} failed" error="failed to create containerd container: rootpath on mountPath C:\\Windows\\TEMP\\ctd-volume2544681623\\247, volume \\config: readlink C:\\Windows\\TEMP\\ctd-volume2544681623\\247: The system cannot find the path
specified."

There are ~5x as many instances of the second error than the first, but the first does not occur on nodes where this pod is not assigned.

MageshSrinivasulu commented 2 years ago

@mbelt I came across this where some changes need to be done on docker file while using containerd https://github.com/containerd/containerd/issues/6300#issuecomment-988048374. Try it if it's of any help.

mbelt commented 2 years ago

@MageshSrinivasulu Confirmed the root cause of the container failing to start was containerd #5067.

The other error messages persist, but they must be unrelated.

Azure / AKS