Closed mbelt closed 2 years ago
@mbelt
resourceID
format issue, right format is like this (?i)subscriptions/(.+)/resourceGroups/(.+)/providers/(.+?)/(.+?)/(.+)
I am also facing a similar issue after upgrading to 1.23.5 from 1.22.11. Where windows pods are not getting started and runs into an infinite crash loop with the below errors
@andyzhangx I tried re-imaging the vmss instances as well. But it's of no use.
@mbelt
- about the
resourceID
format issue, right format is like this(?i)subscriptions/(.+)/resourceGroups/(.+)/providers/(.+?)/(.+?)/(.+)
- about the Third symptom: could vmss instance reimage mitigate this issue?
@andyzhangx
The resourceID
of the node MSI is formatted (?i)subscriptions/(.+)/resourceGroups/(.+)/providers/Microsoft.ManagedIdentity/userAssignedIdentities/(.+)
but more to the point, I am not providing any information about this identity, the driver is getting it from IMDS and the MSI was working as intended before upgrading.
I have since created a fresh cluster in a new resource group in a different region and started out with version 1.23.5
and this same error message occurs.
@AbelHu do you have any insight about windows container creation failure on windows node with below error:
Warning Failed 51m (x4 over 53m) kubelet Error: failed to start containerd task "REDACT": hcs::System::CreateProcess REDACT: The system cannot find the file specified.: unknown
...
Created container REDACT
...
Warning Failed 34m (x2 over 35m) kubelet Error: failed to start containerd task "REDACT": hcs::System::Start REDACT: The virtual machine or container exited unexpectedly.: unknown
@andyzhangx it does not contain enough info. Suggest collecting full nodes logs for advanced investigation. You may find the clue in kubelet logs
@mbelt if you are using AKS managed CSI driver, the identity CSI driver using is Control plane
identity. Have you changed the /etc/kubernetes/azure.json
file on every agent node? The right way is bring your own identity for the while control plane, follow guide here: https://docs.microsoft.com/en-us/azure/aks/use-managed-identity#summary-of-managed-identities
For the first and third issues, it's all related to windows container creation failure on Windows node, pls file an azure support ticket.
@andyzhangx Ack on one and three.
For the second issue, no I haven't touched /etc/kubernetes/azure.json
. My mistake claiming it was using node identity. This issue is occurring on a fresh cluster with no k8s specs deployed and no changes made to the nodes.
Kubelet logs contain one additional piece of information.
On nodes where the failing Pod is placed I am seeing:
A virtual machine or container with the specified identifier does not exist.\" name=HcsOpenComputeSystem
containerd.err-20220809T161122.479.log:20084:time="2022-08-09T16:10:18.544895100Z" level=warning msg="cleanup warnings time=\"2022-08-09T16:10:18Z\" level=error msg=Span duration=3.489ms endTime=\"2022-08-09 16:10:18.4924774 +0000 GMT m=+0.008302101\" error=\"A virtual machine or container with the specified identifier does not exist.\" name=HcsOpenComputeSystem parentSpanID=5a4e85ad2b823f08 spanID=d7514362e2be6193 startTime=\"2022-08-09 16:10:18.4889884 +0000 GMT
m=+0.004813101\" traceID=b04266fb0a7eb50b6ebad34de484f304\ntime=\"2022-08-09T16:10:18Z\" level=info msg=Span duration=30.4596ms endTime=\"2022-08-09 16:10:18.5189078 +0000 GMT m=+0.034732601\" name=delete parentSpanID=0000000000000000 spanID=5a4e85ad2b823f08 startTime=\"2022-08-09 16:10:18.4884482 +0000 GMT m=+0.004273001\" traceID=b04266fb0a7eb50b6ebad34de484f304\n"
containerd.err-20220809T161122.479.log:20180:time="2022-08-09T16:10:19.943501700Z" level=error msg="CreateContainer within sandbox \"667d46e20c142094683b151d9f9b9e4b5b14612bb481237a167f16afd1d51b5b\" for &ContainerMetadata{Name:qas,Attempt:0,} failed" error="failed to create containerd container: rootpath on mountPath C:\\Windows\\TEMP\\ctd-volume2544681623\\247, volume \\config: readlink C:\\Windows\\TEMP\\ctd-volume2544681623\\247: The system cannot find the path
specified."
There are ~5x as many instances of the second error than the first, but the first does not occur on nodes where this pod is not assigned.
@mbelt I came across this where some changes need to be done on docker file while using containerd https://github.com/containerd/containerd/issues/6300#issuecomment-988048374. Try it if it's of any help.
@MageshSrinivasulu Confirmed the root cause of the container failing to start was containerd #5067.
The other error messages persist, but they must be unrelated.
Describe the bug After upgrading from 1.22.6 to 1.23.8
az aks upgrade ... --kubernetes-version 1.23.8
our Windows nodepool made the jump from dockerd to containerd.Afterwards multiple containers fail to start, csi-drivers on windows nodepools are restarting with errors.
First symptom:
ContainerError while setting up mounted configMap.
On the node, the directory
C:\Windows\TEMP\ctd-volumectd-volume3189089798/620
does exist, a symlink to\\?\Volume{guid}
, but its contents could not be listed.Attempted to mitigate the issue by:
No success
Second Symptom
Errors and restarts of csi-azurefile-node-win-xxxx and csi-azuredisk-node-win-xxxx
Excerpts from csi-azuredisk-node-win:
Full log
The csi-azurefile-node-win pod has the same errors. The csi-azurexxxx-node pods on the linux VMSS also have the error about failing to parse the resource ID, but none of the other errors.
Third symptom
An pod from a different deployment is failing to start after the upgrade.
These errors all relate to storage driver on the node in some way, so I have grouped them in a single bug report.
Environment