Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 304 forks source link

Windows Nodes Don't Currently Support Out of Memory Eviction (OOMKILL) #2820

Closed ojfw20 closed 2 weeks ago

ojfw20 commented 2 years ago

What happened: Pods fail to start on a Windows nodepool after resource demands increase past node capacity. Pods on a Windows node also start paging to disk when the node runs out of memory.

What you expected to happen: OOMKill feature triggers scheduling of pods on a node with free memory, allowing pods to start as expected.

How to reproduce it (as minimally and precisely as possible): Overallocate a Windows node with pods, trigger pods to request more memory than the node can provide.

Anything else we need to know?: We have been informed that OOMKill is not support on Windows nodes. This seems to be a gaping hole in the feasibility of using Windows nodepools for any sort of elastic scalability. We would like to see OOMKill supported on Windows nodepools.

https://kubernetes.io/docs/setup/production-environment/windows/intro-windows-in-kubernetes/#kubelet-compatibility agrees, and states that:

Environment:

ghost commented 2 years ago

Hi ojfw20, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

ghost commented 2 years ago

Triage required from @Azure/aks-pm

ojfw20 commented 2 years ago

/sig windows

ghost commented 2 years ago

@immuzz, @justindavies would you be able to assist?

Issue Details
**What happened**: Pods fail to start on a Windows nodepool after resource demands increase past node capacity. Pods on a Windows node also start paging to disk when the node runs out of memory. **What you expected to happen**: OOMKill feature triggers scheduling of pods on a node with free memory, allowing pods to start as expected. **How to reproduce it (as minimally and precisely as possible)**: Overallocate a Windows node with pods, trigger pods to request more memory than the node can provide. **Anything else we need to know?**: We have been informed that OOMKill is not support on Windows nodes. This seems to be a gaping hole in the feasibility of using Windows nodepools for any sort of elastic scalability. We would like to see OOMKill supported on Windows nodepools. https://kubernetes.io/docs/setup/production-environment/windows/intro-windows-in-kubernetes/#kubelet-compatibility agrees, and states that: - The (Windows) kubelet does not take OOM eviction actions - Eviction by using --enforce-node-allocable is not implemented - Eviction by using --eviction-hard and --eviction-soft are not implemented **Environment**: - Kubernetes version (use `kubectl version`): 1.21.7 - Size of cluster (how many worker nodes are in the cluster?): 4 Windows Nodes, 2 Linux Nodes - General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.):
Author: ojfw20
Assignees: -
Labels: `feature-request`, `triage`, `windows`, `action-required`
Milestone: -
ojfw20 commented 2 years ago

Hi! Any update on this?

AbelHu commented 2 years ago

I think that this needs upstream support. cc @allyford

ojfw20 commented 1 year ago

bump @allyford

AbelHu commented 1 year ago

Reference https://github.com/kubernetes/kubernetes/issues/119184

allyford commented 7 months ago

Reference kubernetes/kubernetes#119184

Based on the update here, creating a separate feature request specifically for adding the new kubelet parameters from upstream into AKS. See #4068

allyford commented 2 weeks ago

Closing this issue. Upstream investigations of node conditions that lead to evictions can be found here: https://github.com/kubernetes/kubernetes/issues/119184

Now that upstream has supported memory based eviction for windows, using #4068 for tracking