aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.6k stars 919 forks source link

Incorrect Number of maxPods / Node Pods Capacity #6890

Open msvechla opened 2 weeks ago

msvechla commented 2 weeks ago

Description

Observed Behavior:

Since we upgraded to Karpenter v1 we observed incorrect kubelet maxPods settings for multiple nodes. We initially only noticed the issue with m7a.medium instances, however today we also had a case with an r7a.medium instance.

The issue becomes visible when multiple pods on a node in the cluster are stuck in initializing with:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "850cdbed09a9f986b2370c7409fb9e5ee782846056ec7466fb13e863f6e225ad": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

Checking the node, it immediately becomes obvious that too many pods have been scheduled on it, and the node is running out of IP addresses.

In the example with m7a.medium we observed multiple nodes in the same cluster (all m7a.medium) with a different status.capacity.pods specified.

We observed nodes with 8, 58 and 29 maxPods in the cluster.

According to https://github.com/awslabs/amazon-eks-ami/blob/main/templates/shared/runtime/eni-max-pods.txt#L518 the correct number should be 8. So the nodes which had a higher number specified ran into the issue mentioned above.

Logging into the nodes and checking the kubelet config revealed the following:

[root@ip]# cat /etc/kubernetes/kubelet/config.json.d/00-nodeadm.conf |grep maxPods
    "maxPods": 29,
[root@ip]# cat /etc/kubernetes/kubelet/config.json |grep maxPods
  "maxPods": 8,

So it appears that the correct value is specified in /etc/kubernetes/kubelet/config.json but overwritten in /etc/kubernetes/kubelet/config.json.d/00-nodeadm.conf.

We use AL2023 and we do not specify any value for podsPerCore in our karpenter resources or similar.

As we had different nodes of the same instance type with varying values, this could also be some kind of race condition or similar.

Expected Behavior:

Calculated maxPods matches value in https://github.com/awslabs/amazon-eks-ami/blob/main/templates/shared/runtime/eni-max-pods.txt

Reproduction Steps (Please include YAML):

Used EC2NodeClass

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiSelectorTerms:
  - alias: al2023@v20240807
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      deleteOnTermination: true
      encrypted: true
      volumeSize: 30Gi
      volumeType: gp3
  detailedMonitoring: true
  instanceProfile: karpenter
  kubelet:
    kubeReserved:
      cpu: 200m
      ephemeral-storage: 1Gi
      memory: 200Mi
    systemReserved:
      cpu: 100m
      ephemeral-storage: 1Gi
      memory: 200Mi
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required
  securityGroupSelectorTerms:
  - name: karpenter
  subnetSelectorTerms:
  - tags:
      Name: private
  userData: |
    #!/bin/bash

    # https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#6-minute-delays-in-attaching-volumes
    # https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1955
    echo -e "InhibitDelayMaxSec=45\n" >> /etc/systemd/logind.conf
    systemctl restart systemd-logind
    echo "$(jq ".shutdownGracePeriod=\"400s\"" /etc/kubernetes/kubelet/config.json)" > /etc/kubernetes/kubelet/config.json
    echo "$(jq ".shutdownGracePeriodCriticalPods=\"100s\"" /etc/kubernetes/kubelet/config.json)" > /etc/kubernetes/kubelet/config.json
    systemctl restart kubelet

Versions:

rschalo commented 2 weeks ago

It looks like this problem is related to https://karpenter.sh/v1.0/troubleshooting/#maxpods-is-greater-than-the-nodes-supported-pod-density.

I'll point out that some of the language there needs to be updated, for example I believe NodePods in Solution 2 was meant to point to NodePools and the pod density section now directs to the EC2NodeClass kubelet config section since it was moved there from NodePools in v1.

Please share an update if the problem persists after updating the kubelet spec or enabling prefix delegation.

msvechla commented 2 weeks ago

I'm not quite sure what you mean. I posted my kubelet spec / the entire EC2NodeClass in the original post above. We are not specifying any maxPods as is mentioned in the troubleshooting guide, so it must mean karpenter is setting an incorrect amount.

Or did I misunderstand something?

We are not using prefix delegation, and according to the docs it should also not be required.

Can you share what exactly we should update in the kubelet config?

It is also weird that karpenter sets a different pod capacity for different nodes of the same instance type in the cluster, so to me this still looks like a bug.

waihong commented 2 weeks ago

We are encountering a similar problem that began with the upgrade to v1.0.0. We have noticed an excessive number of pods being scheduled on t3.small/t3a.small instances. Our kubelet configuration does not specify any maxPods settings as well.

iharris-luno commented 2 weeks ago

We're also seeing this issue after upgrading to v1.0.0. Around 10% of new nodes have wildly high allocatable pods (eg 205 for a c6a.2xlarge), whereas mostly the calculations are correct (ie 44 for a c6a.2xlarge, as we have RESERVED_ENIS=1 in the karpenter controller). We've had to hardcode maxPods:44 in our EC2NodeClass to prevent hundreds of pods getting stuck in FailedCreatePodSandBox status. I can confirm that the affected nodes have an incorrect maxPods value in the # Karpenter Generated NodeConfig of the instance user-data. (So AL2023 / kubelet is doing what it's told, and the problem is in karpenter's maxPods calculations) I can reproduce this issue on multiple AWS accounts / EKS clusters / Regions / Instance families, and it affects both AL2 and AL2023 AMI families, and both with and without RESERVED_ENIS set in the karpenter controller.

iharris-luno commented 2 weeks ago

It appears to be related to the presence or absence of a kubelet stanza in the EC2NodeClass...

Reproduction Steps: Create a deployment with 50 replicas, with node anti-affinity, in a nodepool which uses the following EC2NodeClass...

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: iharris
spec:
  amiSelectorTerms:
  - alias: al2023@latest
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      encrypted: true
      kmsKeyID: <redacted>
      volumeSize: 150Gi
      volumeType: gp3
  role: karpenter-node-role.<redacted>
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: staging
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: staging

All 50 nodes have the correct .status.allocatable.pods - yay!

Change the EC2NodeClass to...

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: iharris
spec:
  amiSelectorTerms:
  - alias: al2023@latest
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      encrypted: true
      kmsKeyID: <redacted>
      volumeSize: 150Gi
      volumeType: gp3
  kubelet:
    imageGCLowThresholdPercent: 65
  role: karpenter-node-role.<redacted>
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: staging
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: staging

Around 5-10% of the 50 nodes have an incorrect .status.allocatable.pods - boo! (Nothing special about imageGCLowThresholdPercent, it seems to be the presence of spec.kubelet that triggers the behaviour.)

I think we need that bug label back, sorry!

engedaam commented 2 weeks ago

Can you share your NodePool? do you have the compatibility.karpenter.sh/v1beta1-kubelet-conversion annotation set on the nodepool?

iharris-luno commented 2 weeks ago
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  annotations:
    karpenter.sh/nodepool-hash: "15612137669406834936"
    karpenter.sh/nodepool-hash-version: v3
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"karpenter.sh/v1","kind":"NodePool","metadata":{"annotations":{},"name":"iharris"},"spec":{"disruption":{"budgets":[{"nodes":"100%"}],"consolidateAfter":"1m","consolidationPolicy":"WhenEmptyOrUnderutilized"},"limits":{"cpu":"500","memory":"2000Gi"},"template":{"metadata":{"labels":{"role":"iharris"}},"spec":{"expireAfter":"1h","nodeClassRef":{"group":"karpenter.k8s.aws","kind":"EC2NodeClass","name":"iharris"},"requirements":[{"key":"karpenter.k8s.aws/instance-category","operator":"In","values":["c","m","r"]},{"key":"karpenter.k8s.aws/instance-generation","operator":"In","values":["5","6"]},{"key":"karpenter.k8s.aws/instance-cpu","operator":"Gt","values":["7"]},{"key":"kubernetes.io/os","operator":"In","values":["linux"]},{"key":"kubernetes.io/arch","operator":"In","values":["amd64"]},{"key":"karpenter.sh/capacity-type","operator":"In","values":["on-demand"]}],"taints":[{"effect":"NoSchedule","key":"iharris","value":"true"}]}}}}
  creationTimestamp: "2024-08-29T15:29:45Z"
  generation: 4
  name: iharris
  resourceVersion: "864235779"
  uid: <redacted>
spec:
  disruption:
    budgets:
    - nodes: 100%
    consolidateAfter: 1m
    consolidationPolicy: WhenEmptyOrUnderutilized
  limits:
    cpu: "500"
    memory: 2000Gi
  template:
    metadata:
      labels:
        role: iharris
    spec:
      expireAfter: 1h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: iharris
      requirements:
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values:
        - c
        - m
        - r
      - key: karpenter.k8s.aws/instance-generation
        operator: In
        values:
        - "5"
        - "6"
      - key: karpenter.k8s.aws/instance-cpu
        operator: Gt
        values:
        - "7"
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
      taints:
      - effect: NoSchedule
        key: iharris
        value: "true"
status:
  conditions:
  - lastTransitionTime: "2024-08-29T15:29:45Z"
    message: ""
    reason: NodeClassReady
    status: "True"
    type: NodeClassReady
  - lastTransitionTime: "2024-08-29T15:29:45Z"
    message: ""
    reason: Ready
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-08-29T15:29:45Z"
    message: ""
    reason: ValidationSucceeded
    status: "True"
    type: ValidationSucceeded
  resources:
    cpu: "0"
    ephemeral-storage: "0"
    memory: "0"
    nodes: "0"
    pods: "0"

That's a new nodepool, created to test this issue. The old nodepools that were upgraded from v0.35.7 have eg a compatibility.karpenter.sh/v1beta1-nodeclass-reference: '{"name":"default"}' annotation, but none have compatibility.karpenter.sh/v1beta1-kubelet-conversion annotations.

engedaam commented 2 weeks ago

Can you provide all your NodePool and EC2NodeClass in the cluster?

iharris-luno commented 2 weeks ago

Sure thing, here's the -oyaml from the cluster I'm currently testing in: issue-6890-resources.txt. I've reproduced the issue in both the pre-upgrade default, and the post-upgrade iharris ec2nc/nodepools.

msvechla commented 1 week ago

Could it be related to https://github.com/aws/karpenter-provider-aws/pull/6167 which was included in v0.37.0? It mentions data races and to me this looks like a data race, as nodes of the exact same instance type have a different value assigned. As part of the v1 upgrade we also updated from v0.36.2 to the latest v0.37.x

EDIT: Its probably unrelated, as our clusters on v0.37.x have not shown this issue so far, only clusters on v1.x

msvechla commented 1 week ago

Something else I noticed:

The NodeClaim of the affected nodes has the correct value in .status.capacity.pods, just the matching Node has an incorrect value for .status.capacity.pods

@iharris-luno what instance types have been affected in your case? Also r7a.medium and m7a.medium?

iharris-luno commented 1 week ago

We've seen the issue in c6a.2xlarge and r5a.2xlarge instances. Good spot on the NodeClaim vs Node versions of .status.capacity.pods. However it doesn't seem that the NodeClaims are always correct... I just found a NodeClaim with an incorrect .status.capacity.pods:205.

engedaam commented 5 days ago

@iharris-luno I used you configuration and I was not able to replicate the issue. Do you think you can share the node and nodeclaims that were impacted by the issue?

caiohasouza commented 4 days ago

Hi,

I have the same issue with a t3.small instance:

nodeClaim.status.allocatable:
    Cpu:                  1930m
    Ephemeral - Storage:  35Gi
    Memory:               1418Mi
    Pods:                 11
node.status.allocatable:
    cpu:                1930m
    ephemeral-storage:  37569620724
    hugepages-1Gi:      0
    hugepages-2Mi:      0
    memory:             1483068Ki
    pods:               8

I'm using 1.0.1 version but i tested with 1.0.2 version too.

Regards

iharris-luno commented 4 days ago

I've just spun up 2000 c6a.2xlarge nodes in batches of 50, and not one of them had an incorrect NodeClaim. (If I'd realised how rare they were, compared to incorrect Nodes, I'd have grabbed the yaml of the one I found previously!). Plenty of incorrect nodes though (225 / 2000), so here's one of them and its associated nodeclaim... node-1.zip

k24dizzle commented 3 days ago

Saw these values on a r7a.medium

node.status

Allocatable:
  cpu:                940m
  ephemeral-storage:  95551679124
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             7467144Ki
  pods:               58

nodeclaim.status

  Allocatable:
    Cpu:                        940m
    Ephemeral - Storage:        89Gi
    Memory:                     7134Mi
    Pods:                       8
    vpc.amazonaws.com/pod-eni:  4