Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 308 forks source link

[BUG] configMap volumeMount -- failed to create containerd container: rootpath on mountPath ... The system cannot find the path specified. #3117

Closed mbelt closed 2 years ago

mbelt commented 2 years ago

Describe the bug After upgrading from 1.22.6 to 1.23.8 az aks upgrade ... --kubernetes-version 1.23.8 our Windows nodepool made the jump from dockerd to containerd.

Afterwards multiple containers fail to start, csi-drivers on windows nodepools are restarting with errors.

First symptom:

ContainerError while setting up mounted configMap.

...
volumeMounts:                                                                                                                                                                                         
 - mountPath: /config                                                                                                                                                                                  
   name: configs   
...
volumes:                                                                                                                                                                                                                                                                                                                                                                      
 - configMap:                                                                                                                                                                                            
     defaultMode: 420                                                                                                                                                                                    
     name: redacted.configs                                                                                                                                                                 
   name: configs   
Warning  Failed   kubelet    Error: failed to create containerd container: rootpath on mountPath C:\Windows\TEMP\ctd-volume3189089798/620, volume \config: readlink C:\Windows\TEMP\ctd-volume3189089798/620: The system cannot find the path specified.

On the node, the directory C:\Windows\TEMP\ctd-volumectd-volume3189089798/620 does exist, a symlink to \\?\Volume{guid}, but its contents could not be listed. image

Attempted to mitigate the issue by:

  1. Removing all specs in namespace and redeploying.
  2. Forcing pod to a different node.
  3. Restarting nodepool VMSS.

No success

Second Symptom

Errors and restarts of csi-azurefile-node-win-xxxx and csi-azuredisk-node-win-xxxx

Excerpts from csi-azuredisk-node-win:

azure_disk_utils.go:166] InitializeCloudFromSecret: failed to get cloud config from secret kube-system/azure-cloud-provider: failed to get secret kube-system/azure-cloud-provider:
azure_disk_utils.go:171] could not read cloud config from secret kube-system/azure-cloud-provider                                                                                  
azure_auth.go:113] azure: User Assigned MSI ID is client ID. Resource ID parsing error: %+vparsing failed for 8704bcea-5513-4b7c-99f2-dd8e266cb897. Invalid resource Id format     
safe_mounter_windows.go:410] failed to connect to csi-proxy v1 with error: open \\.\\pipe\\csi-proxy-filesystem-v1: The system cannot find the file specified., will try with v1Bet

Full log

I0803 23:29:04.578169   16040 main.go:128] set up prometheus server on :29605

I0803 23:29:04.829625   16040 azuredisk.go:168]
DRIVER INFORMATION:
-------------------
Build Date: "2022-06-10T08:56:40Z"
Compiler: gc
Driver Name: disk.csi.azure.com
Driver Version: v1.19.0
Git Commit: 538a8cec600350de55e42ec14ebbd8733bb4b498                                                                                                                                                             Go Version: go1.18.1
Platform: windows/amd64
Topology Key: topology.disk.csi.azure.com/zone

Streaming logs below:                                                                                                                                                                                            
I0803 23:29:04.830161   16040 azuredisk.go:171] driver userAgent: disk.csi.azure.com/v1.19.0 gc/go1.18.1 (amd64-windows)                                                                                         
I0803 23:29:04.897621   16040 azure_disk_utils.go:159] reading cloud config from secret kube-system/azure-cloud-provider                                                                                         
I0803 23:29:05.011109   16040 azure_disk_utils.go:166] InitializeCloudFromSecret: failed to get cloud config from secret kube-system/azure-cloud-provider: failed to get secret kube-system/azure-cloud-provider:
I0803 23:29:05.011109   16040 azure_disk_utils.go:171] could not read cloud config from secret kube-system/azure-cloud-provider                                                                                  
I0803 23:29:05.011109   16040 azure_disk_utils.go:174] AZURE_CREDENTIAL_FILE env var set as C:\k\azure.json                                                                                                      
I0803 23:29:05.015493   16040 azure_disk_utils.go:189] read cloud config from file: C:\k\azure.json successfully                                                                                                 
I0803 23:29:05.016553   16040 azure_auth.go:245] Using AzurePublicCloud environment                                                                                                                              
I0803 23:29:05.017153   16040 azure_auth.go:96] azure: using managed identity extension to retrieve access token                                                                                                 
I0803 23:29:05.017153   16040 azure_auth.go:102] azure: using User Assigned MSI ID to retrieve access token                                                                                                      
I0803 23:29:05.017153   16040 azure_auth.go:113] azure: User Assigned MSI ID is client ID. Resource ID parsing error: %+vparsing failed for 8704bcea-5513-4b7c-99f2-dd8e266cb897. Invalid resource Id format     
I0803 23:29:05.018225   16040 azure_diskclient.go:68] Azure DisksClient using API version: 2021-04-01                                                                                                            
I0803 23:29:05.019870   16040 azure.go:993] attach/detach disk operation rate limit QPS: 6.000000, Bucket: 10                                                                                                    
I0803 23:29:05.020401   16040 safe_mounter_windows.go:410] failed to connect to csi-proxy v1 with error: open \\.\\pipe\\csi-proxy-filesystem-v1: The system cannot find the file specified., will try with v1Bet
I0803 23:29:05.048267   16040 safe_mounter_windows.go:415] using CSIProxyMounterV1beta, API Versions filesystem: v1beta1, disk: v1beta2, volume: v1beta2                                                         
I0803 23:29:05.048436   16040 driver.go:81] Enabling controller service capability: CREATE_DELETE_VOLUME                                                                                                         
I0803 23:29:05.048436   16040 driver.go:81] Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME                                                                                                     
I0803 23:29:05.048436   16040 driver.go:81] Enabling controller service capability: CREATE_DELETE_SNAPSHOT                                                                                                       
I0803 23:29:05.048436   16040 driver.go:81] Enabling controller service capability: CLONE_VOLUME                                                                                                                 
I0803 23:29:05.048436   16040 driver.go:81] Enabling controller service capability: EXPAND_VOLUME                                                                                                                
I0803 23:29:05.048436   16040 driver.go:81] Enabling controller service capability: SINGLE_NODE_MULTI_WRITER                                                                                                     
I0803 23:29:05.048436   16040 driver.go:100] Enabling volume access mode: SINGLE_NODE_WRITER                                                                                                                     
I0803 23:29:05.048436   16040 driver.go:100] Enabling volume access mode: SINGLE_NODE_READER_ONLY                                                                                                                
I0803 23:29:05.048436   16040 driver.go:100] Enabling volume access mode: SINGLE_NODE_SINGLE_WRITER                                                                                                              
I0803 23:29:05.048436   16040 driver.go:100] Enabling volume access mode: SINGLE_NODE_MULTI_WRITER                                                                                                               
I0803 23:29:05.048436   16040 driver.go:91] Enabling node service capability: STAGE_UNSTAGE_VOLUME                                                                                                               
I0803 23:29:05.048436   16040 driver.go:91] Enabling node service capability: EXPAND_VOLUME                                                                                                                      
I0803 23:29:05.048436   16040 driver.go:91] Enabling node service capability: GET_VOLUME_STATS                                                                                                                   
I0803 23:29:05.048436   16040 driver.go:91] Enabling node service capability: SINGLE_NODE_MULTI_WRITER                                                                                                           
I0803 23:29:05.051847   16040 server.go:117] Listening for connections on address: &net.UnixAddr{Name:"C:\\\\csi\\\\csi.sock", Net:"unix"}                                                                       
I0803 23:29:05.401222   16040 utils.go:77] GRPC call: /csi.v1.Identity/GetPluginInfo                                                                                                                             
I0803 23:29:05.401222   16040 utils.go:78] GRPC request: {}                                                                                                                                                      
I0803 23:29:05.405892   16040 utils.go:84] GRPC response: {"name":"disk.csi.azure.com","vendor_version":"v1.19.0"}                                                                                               
I0803 23:29:06.034441   16040 utils.go:77] GRPC call: /csi.v1.Identity/GetPluginInfo                                                                                                                             
I0803 23:29:06.034441   16040 utils.go:78] GRPC request: {}                                                                                                                                                      
I0803 23:29:06.034441   16040 utils.go:84] GRPC response: {"name":"disk.csi.azure.com","vendor_version":"v1.19.0"}                                                                                               
I0803 23:30:36.594569   16040 utils.go:77] GRPC call: /csi.v1.Identity/GetPluginInfo                                                                                                                             
I0803 23:30:36.594569   16040 utils.go:78] GRPC request: {}                                                                                            
I0803 23:30:37.430621   16040 nodeserver.go:409] got a matching size in getMaxDataDiskCount, VM Size: STANDARD_D4A_V4, MaxDataDiskCount: 8 
I0803 23:30:37.430621   16040 utils.go:84] GRPC response: {"accessible_topology":{"segments":{"topology.disk.csi.azure.com/zone":"eastus2-2","topology.kubernetes.io/zone":"eastus2-2"}},"max_volumes_per_node":8
Stream closed EOF for kube-system/csi-azuredisk-node-win-pcgjh (azuredisk)

The csi-azurefile-node-win pod has the same errors. The csi-azurexxxx-node pods on the linux VMSS also have the error about failing to parse the resource ID, but none of the other errors.

Third symptom

An pod from a different deployment is failing to start after the upgrade.

Warning  Failed          51m (x4 over 53m)     kubelet            Error: failed to start containerd task "REDACT": hcs::System::CreateProcess REDACT: The system cannot find the file specified.: unknown    
...
Created container REDACT
...
Warning  Failed     34m (x2 over 35m)     kubelet     Error: failed to start containerd task "REDACT": hcs::System::Start REDACT: The virtual machine or container exited unexpectedly.: unknown                                         

These errors all relate to storage driver on the node in some way, so I have grouped them in a single bug report.

Environment

andyzhangx commented 2 years ago

@mbelt

MageshSrinivasulu commented 2 years ago

I am also facing a similar issue after upgrading to 1.23.5 from 1.22.11. Where windows pods are not getting started and runs into an infinite crash loop with the below errors

MageshSrinivasulu commented 2 years ago

@andyzhangx I tried re-imaging the vmss instances as well. But it's of no use.

mbelt commented 2 years ago

@mbelt

  • about the resourceID format issue, right format is like this (?i)subscriptions/(.+)/resourceGroups/(.+)/providers/(.+?)/(.+?)/(.+)
  • about the Third symptom: could vmss instance reimage mitigate this issue?

@andyzhangx

The resourceID of the node MSI is formatted (?i)subscriptions/(.+)/resourceGroups/(.+)/providers/Microsoft.ManagedIdentity/userAssignedIdentities/(.+) but more to the point, I am not providing any information about this identity, the driver is getting it from IMDS and the MSI was working as intended before upgrading.

I have since created a fresh cluster in a new resource group in a different region and started out with version 1.23.5 and this same error message occurs.

andyzhangx commented 2 years ago

@AbelHu do you have any insight about windows container creation failure on windows node with below error:

Warning  Failed          51m (x4 over 53m)     kubelet            Error: failed to start containerd task "REDACT": hcs::System::CreateProcess REDACT: The system cannot find the file specified.: unknown    
...
Created container REDACT
...
Warning  Failed     34m (x2 over 35m)     kubelet     Error: failed to start containerd task "REDACT": hcs::System::Start REDACT: The virtual machine or container exited unexpectedly.: unknown
AbelHu commented 2 years ago

@andyzhangx it does not contain enough info. Suggest collecting full nodes logs for advanced investigation. You may find the clue in kubelet logs

andyzhangx commented 2 years ago

@mbelt if you are using AKS managed CSI driver, the identity CSI driver using is Control plane identity. Have you changed the /etc/kubernetes/azure.json file on every agent node? The right way is bring your own identity for the while control plane, follow guide here: https://docs.microsoft.com/en-us/azure/aks/use-managed-identity#summary-of-managed-identities

For the first and third issues, it's all related to windows container creation failure on Windows node, pls file an azure support ticket.

mbelt commented 2 years ago

@andyzhangx Ack on one and three.

For the second issue, no I haven't touched /etc/kubernetes/azure.json. My mistake claiming it was using node identity. This issue is occurring on a fresh cluster with no k8s specs deployed and no changes made to the nodes.

mbelt commented 2 years ago

Kubelet logs contain one additional piece of information.

On nodes where the failing Pod is placed I am seeing: A virtual machine or container with the specified identifier does not exist.\" name=HcsOpenComputeSystem

containerd.err-20220809T161122.479.log:20084:time="2022-08-09T16:10:18.544895100Z" level=warning msg="cleanup warnings time=\"2022-08-09T16:10:18Z\" level=error msg=Span duration=3.489ms endTime=\"2022-08-09 16:10:18.4924774 +0000 GMT m=+0.008302101\" error=\"A virtual machine or container with the specified identifier does not exist.\" name=HcsOpenComputeSystem parentSpanID=5a4e85ad2b823f08 spanID=d7514362e2be6193 startTime=\"2022-08-09 16:10:18.4889884 +0000 GMT
m=+0.004813101\" traceID=b04266fb0a7eb50b6ebad34de484f304\ntime=\"2022-08-09T16:10:18Z\" level=info msg=Span duration=30.4596ms endTime=\"2022-08-09 16:10:18.5189078 +0000 GMT m=+0.034732601\" name=delete parentSpanID=0000000000000000 spanID=5a4e85ad2b823f08 startTime=\"2022-08-09 16:10:18.4884482 +0000 GMT m=+0.004273001\" traceID=b04266fb0a7eb50b6ebad34de484f304\n"

containerd.err-20220809T161122.479.log:20180:time="2022-08-09T16:10:19.943501700Z" level=error msg="CreateContainer within sandbox \"667d46e20c142094683b151d9f9b9e4b5b14612bb481237a167f16afd1d51b5b\" for &ContainerMetadata{Name:qas,Attempt:0,} failed" error="failed to create containerd container: rootpath on mountPath C:\\Windows\\TEMP\\ctd-volume2544681623\\247, volume \\config: readlink C:\\Windows\\TEMP\\ctd-volume2544681623\\247: The system cannot find the path
specified."

There are ~5x as many instances of the second error than the first, but the first does not occur on nodes where this pod is not assigned.

MageshSrinivasulu commented 2 years ago

@mbelt I came across this where some changes need to be done on docker file while using containerd https://github.com/containerd/containerd/issues/6300#issuecomment-988048374. Try it if it's of any help.

mbelt commented 2 years ago

@MageshSrinivasulu Confirmed the root cause of the container failing to start was containerd #5067.

The other error messages persist, but they must be unrelated.