Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.94k stars 301 forks source link

[BUG] Frequent Pod restarts when using Windows Containers with ephemeral disks #3922

Open lippertmarkus opened 10 months ago

lippertmarkus commented 10 months ago

Describe the bug When using ephemeral OS disks for the Windows nodes in our cluster, we experience sporadic Pod restarts, especially when there's a high load on the node (e.g. multiple pods starting at the same time). The same pods run fine without issues on nodes with managed OS disks.

On nodes with ephemeral OS disks containers terminate with Exit Code: -1073741510 and events show:

Events:
  Type     Reason     Age                From     Message
  ----     ------     ----               ----     -------
  Warning  Unhealthy  57m (x5 over 92m)  kubelet  Startup probe failed: Starting the CLR failed with HRESULT 8007000e.\r

  Warning  Unhealthy  43m (x5 over 129m)   kubelet  Readiness probe failed:
  Warning  Unhealthy  42m (x31 over 104m)  kubelet  (combined from similar events): Liveness probe failed: Starting the CLR failed with HRESULT 80004005.\r

  Warning  Unhealthy  37m (x30 over 129m)  kubelet  Liveness probe failed: Starting the CLR failed with HRESULT 80004005.\r

To Reproduce Steps to reproduce the behavior:

  1. Set up an AKS cluster with both Linux and Windows nodes and ephemeral OS disks
  2. Deploy multiple pods in parallel with higher CPU/memory resource demands to utilize 70+% of the node capacity
  3. Observe pod restarts with exit codes and events like described above

It might be related to our specific workload, I can give you access to the container images for troubleshooting.

Expected behavior Pods should be running stable

Screenshots Pods on nodes with ephemeral OS disk: image

Pods on nodes with managed OS disk: image

Environment (please complete the following information):

Additional context

shurick81 commented 10 months ago

Hi @lippertmarkus, we experience similar issue too. Have you checked that the nodes are restarted too? We found in our case that some nodes crash:

az monitor activity-log list \
  --subscription "Sub1" \
  --resource-id /subscriptions/a8549e6e-fbc2-4ab2-b2c0-6fd39f8e6ffe/resourceGroups/aks-smb-multiple-connection-issue-common-01/providers/Microsoft.Compute/virtualMachineScaleSets/akswin00 \
  --offset 2d \
  --query "[?properties.title == 'Machine crashed'].{eventTimestamp: eventTimestamp, resourceId: resourceId}" \
  --max-events 1000;
shurick81 commented 10 months ago

We have an ongoing support case with Microsoft related to our issue but I'm not sure if it is related to yours. By the way, I wonder if that can be also related to the in-place Windows patching on these machines. Or aren't these machines ever patched, only reimaged?

microsoft-github-policy-service[bot] commented 6 months ago

Action required from @Azure/aks-pm

microsoft-github-policy-service[bot] commented 5 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 5 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 4 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 4 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 3 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 3 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 2 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 2 months ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 1 month ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 1 month ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 3 weeks ago

Issue needing attention of @Azure/aks-leads

microsoft-github-policy-service[bot] commented 1 week ago

Issue needing attention of @Azure/aks-leads