Closed jalawala closed 2 years ago
I’ve run into this as well and I think narrowed it down to the number of DaemonSets I had running. For a t4g.micro instance, the max pod limit is only 4. So, there actually wasn’t enough room on the nodes to run the workloads. Are you running any DaemonSets?
It looks like our maxpods calculations are not factoring in daemonsets. We need to fix this.
I bet this is what I was running into with #930. I’ll close that one in favor of this one.
It looks like also static/mirror Pods need to be taken into consideration. I frequently run into OutOfPods
errors that I think comes from Pods being scheduled to a given node, but then enter this state as static pods take precedence.
I have recreated the original issue. The t4g.micro node is terminated because pods fail to schedule on it due to insufficient memory (pod requests 1000Mi and t4g.micro allocatable is only 558Mi).
The issue is caused by incorrect resource calculation in binpacking algorithm.
There is potentially a problem with maxpod calculation as well. I will fix the wrong calculation first and do some further experiment to confirm if maxpod calculation is also incorrect.
Create a separate issue to track the daemonset problem mentioned above (#1084)
Issue is resolved by https://github.com/aws/karpenter/pull/1080. Coming in 0.5.4
Version
Karpenter: v0.0.0
Kubernetes: v1.0.0 0.5.3
Expected Behavior
This Issue is observed with smaller instance say t4g.micro which takes longer time to get into ready state T0 = pod in pending Line 4: arm64-6bcdd8f45-95j9v 0/1 Pending 0 5s ip-192-168-77-153.ec2.internal
T1 = pod is terminated by Karpenter assuming that there is some problem with node but node is not ready yet
Line 5: arm64-6bcdd8f45-95j9v 0/1 Terminating 0 52s ip-192-168-77-153.ec2.internal
T2 = node is actually ready
Line 161: ip-192-168-77-153.ec2.internal Ready 67s v1.21.5-eks-bc4871b t4g.micro arm64 spot
Since pod is terminated and node is empty, TTL will be applied and removed. Karpenter will trigger a new instance but same Issue repets forerver and pod NEVER gets scheduled.
Karpenter should consider slow start up of small instance types before considering it as un healthy and then removing. Attacched file captures this beahivour in detail including logs.
Actual Behavior
Karpenter should wait long enough for nodes to be ready before considering them to unhealthy
Steps to Reproduce the Problem
I used below config for arm64 nodeSelector: kubernetes.io/arch: arm64 containers:
Karpenter selected t4g.micro where this issue is observed. detailed logs are attached below
Resource Specs and Logs
karpenter.txt