Open msvechla opened 2 weeks ago
It looks like this problem is related to https://karpenter.sh/v1.0/troubleshooting/#maxpods-is-greater-than-the-nodes-supported-pod-density.
I'll point out that some of the language there needs to be updated, for example I believe NodePods
in Solution 2 was meant to point to NodePools
and the pod density section now directs to the EC2NodeClass kubelet config section since it was moved there from NodePools in v1.
Please share an update if the problem persists after updating the kubelet spec or enabling prefix delegation.
I'm not quite sure what you mean. I posted my kubelet spec / the entire EC2NodeClass
in the original post above. We are not specifying any maxPods as is mentioned in the troubleshooting guide, so it must mean karpenter is setting an incorrect amount.
Or did I misunderstand something?
We are not using prefix delegation, and according to the docs it should also not be required.
Can you share what exactly we should update in the kubelet config?
It is also weird that karpenter sets a different pod capacity for different nodes of the same instance type in the cluster, so to me this still looks like a bug.
We are encountering a similar problem that began with the upgrade to v1.0.0. We have noticed an excessive number of pods being scheduled on t3.small/t3a.small instances. Our kubelet configuration does not specify any maxPods settings as well.
We're also seeing this issue after upgrading to v1.0.0. Around 10% of new nodes have wildly high allocatable pods (eg 205 for a c6a.2xlarge), whereas mostly the calculations are correct (ie 44 for a c6a.2xlarge, as we have RESERVED_ENIS=1 in the karpenter controller).
We've had to hardcode maxPods:44 in our EC2NodeClass to prevent hundreds of pods getting stuck in FailedCreatePodSandBox status.
I can confirm that the affected nodes have an incorrect maxPods value in the # Karpenter Generated NodeConfig
of the instance user-data. (So AL2023 / kubelet is doing what it's told, and the problem is in karpenter's maxPods calculations)
I can reproduce this issue on multiple AWS accounts / EKS clusters / Regions / Instance families, and it affects both AL2 and AL2023 AMI families, and both with and without RESERVED_ENIS set in the karpenter controller.
It appears to be related to the presence or absence of a kubelet
stanza in the EC2NodeClass...
Reproduction Steps: Create a deployment with 50 replicas, with node anti-affinity, in a nodepool which uses the following EC2NodeClass...
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: iharris
spec:
amiSelectorTerms:
- alias: al2023@latest
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
encrypted: true
kmsKeyID: <redacted>
volumeSize: 150Gi
volumeType: gp3
role: karpenter-node-role.<redacted>
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: staging
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: staging
All 50 nodes have the correct .status.allocatable.pods
- yay!
Change the EC2NodeClass to...
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: iharris
spec:
amiSelectorTerms:
- alias: al2023@latest
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
encrypted: true
kmsKeyID: <redacted>
volumeSize: 150Gi
volumeType: gp3
kubelet:
imageGCLowThresholdPercent: 65
role: karpenter-node-role.<redacted>
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: staging
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: staging
Around 5-10% of the 50 nodes have an incorrect .status.allocatable.pods
- boo!
(Nothing special about imageGCLowThresholdPercent
, it seems to be the presence of spec.kubelet
that triggers the behaviour.)
I think we need that bug
label back, sorry!
Can you share your NodePool? do you have the compatibility.karpenter.sh/v1beta1-kubelet-conversion
annotation set on the nodepool?
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
annotations:
karpenter.sh/nodepool-hash: "15612137669406834936"
karpenter.sh/nodepool-hash-version: v3
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"karpenter.sh/v1","kind":"NodePool","metadata":{"annotations":{},"name":"iharris"},"spec":{"disruption":{"budgets":[{"nodes":"100%"}],"consolidateAfter":"1m","consolidationPolicy":"WhenEmptyOrUnderutilized"},"limits":{"cpu":"500","memory":"2000Gi"},"template":{"metadata":{"labels":{"role":"iharris"}},"spec":{"expireAfter":"1h","nodeClassRef":{"group":"karpenter.k8s.aws","kind":"EC2NodeClass","name":"iharris"},"requirements":[{"key":"karpenter.k8s.aws/instance-category","operator":"In","values":["c","m","r"]},{"key":"karpenter.k8s.aws/instance-generation","operator":"In","values":["5","6"]},{"key":"karpenter.k8s.aws/instance-cpu","operator":"Gt","values":["7"]},{"key":"kubernetes.io/os","operator":"In","values":["linux"]},{"key":"kubernetes.io/arch","operator":"In","values":["amd64"]},{"key":"karpenter.sh/capacity-type","operator":"In","values":["on-demand"]}],"taints":[{"effect":"NoSchedule","key":"iharris","value":"true"}]}}}}
creationTimestamp: "2024-08-29T15:29:45Z"
generation: 4
name: iharris
resourceVersion: "864235779"
uid: <redacted>
spec:
disruption:
budgets:
- nodes: 100%
consolidateAfter: 1m
consolidationPolicy: WhenEmptyOrUnderutilized
limits:
cpu: "500"
memory: 2000Gi
template:
metadata:
labels:
role: iharris
spec:
expireAfter: 1h
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: iharris
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values:
- c
- m
- r
- key: karpenter.k8s.aws/instance-generation
operator: In
values:
- "5"
- "6"
- key: karpenter.k8s.aws/instance-cpu
operator: Gt
values:
- "7"
- key: kubernetes.io/os
operator: In
values:
- linux
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
taints:
- effect: NoSchedule
key: iharris
value: "true"
status:
conditions:
- lastTransitionTime: "2024-08-29T15:29:45Z"
message: ""
reason: NodeClassReady
status: "True"
type: NodeClassReady
- lastTransitionTime: "2024-08-29T15:29:45Z"
message: ""
reason: Ready
status: "True"
type: Ready
- lastTransitionTime: "2024-08-29T15:29:45Z"
message: ""
reason: ValidationSucceeded
status: "True"
type: ValidationSucceeded
resources:
cpu: "0"
ephemeral-storage: "0"
memory: "0"
nodes: "0"
pods: "0"
That's a new nodepool, created to test this issue. The old nodepools that were upgraded from v0.35.7 have eg a compatibility.karpenter.sh/v1beta1-nodeclass-reference: '{"name":"default"}'
annotation, but none have compatibility.karpenter.sh/v1beta1-kubelet-conversion
annotations.
Can you provide all your NodePool and EC2NodeClass in the cluster?
Sure thing, here's the -oyaml from the cluster I'm currently testing in: issue-6890-resources.txt. I've reproduced the issue in both the pre-upgrade default
, and the post-upgrade iharris
ec2nc/nodepools.
Could it be related to https://github.com/aws/karpenter-provider-aws/pull/6167 which was included in v0.37.0? It mentions data races and to me this looks like a data race, as nodes of the exact same instance type have a different value assigned. As part of the v1 upgrade we also updated from v0.36.2
to the latest v0.37.x
EDIT: Its probably unrelated, as our clusters on v0.37.x
have not shown this issue so far, only clusters on v1.x
Something else I noticed:
The NodeClaim
of the affected nodes has the correct value in .status.capacity.pods
, just the matching Node
has an incorrect value for .status.capacity.pods
@iharris-luno what instance types have been affected in your case? Also r7a.medium
and m7a.medium
?
We've seen the issue in c6a.2xlarge
and r5a.2xlarge
instances.
Good spot on the NodeClaim
vs Node
versions of .status.capacity.pods
. However it doesn't seem that the NodeClaims
are always correct... I just found a NodeClaim
with an incorrect .status.capacity.pods:205
.
@iharris-luno I used you configuration and I was not able to replicate the issue. Do you think you can share the node and nodeclaims that were impacted by the issue?
Hi,
I have the same issue with a t3.small instance:
nodeClaim.status.allocatable:
Cpu: 1930m
Ephemeral - Storage: 35Gi
Memory: 1418Mi
Pods: 11
node.status.allocatable:
cpu: 1930m
ephemeral-storage: 37569620724
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1483068Ki
pods: 8
I'm using 1.0.1 version but i tested with 1.0.2 version too.
Regards
I've just spun up 2000 c6a.2xlarge nodes in batches of 50, and not one of them had an incorrect NodeClaim
. (If I'd realised how rare they were, compared to incorrect Nodes
, I'd have grabbed the yaml of the one I found previously!). Plenty of incorrect nodes though (225 / 2000), so here's one of them and its associated nodeclaim...
node-1.zip
Saw these values on a r7a.medium
node.status
Allocatable:
cpu: 940m
ephemeral-storage: 95551679124
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7467144Ki
pods: 58
nodeclaim.status
Allocatable:
Cpu: 940m
Ephemeral - Storage: 89Gi
Memory: 7134Mi
Pods: 8
vpc.amazonaws.com/pod-eni: 4
Description
Observed Behavior:
Since we upgraded to Karpenter v1 we observed incorrect kubelet
maxPods
settings for multiple nodes. We initially only noticed the issue withm7a.medium
instances, however today we also had a case with anr7a.medium
instance.The issue becomes visible when multiple pods on a node in the cluster are stuck in initializing with:
Checking the node, it immediately becomes obvious that too many pods have been scheduled on it, and the node is running out of IP addresses.
In the example with
m7a.medium
we observed multiple nodes in the same cluster (allm7a.medium
) with a differentstatus.capacity.pods
specified.We observed nodes with
8
,58
and29
maxPods
in the cluster.According to https://github.com/awslabs/amazon-eks-ami/blob/main/templates/shared/runtime/eni-max-pods.txt#L518 the correct number should be
8
. So the nodes which had a higher number specified ran into the issue mentioned above.Logging into the nodes and checking the kubelet config revealed the following:
So it appears that the correct value is specified in
/etc/kubernetes/kubelet/config.json
but overwritten in/etc/kubernetes/kubelet/config.json.d/00-nodeadm.conf
.We use AL2023 and we do not specify any value for
podsPerCore
in our karpenter resources or similar.As we had different nodes of the same instance type with varying values, this could also be some kind of race condition or similar.
Expected Behavior:
Calculated
maxPods
matches value in https://github.com/awslabs/amazon-eks-ami/blob/main/templates/shared/runtime/eni-max-pods.txtReproduction Steps (Please include YAML):
Used
EC2NodeClass
Versions:
Chart Version: v1.0.1
Kubernetes Version (
kubectl version
): v1.29.6-eks-db838b0Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment