aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.83k stars 960 forks source link

Karpenter 1.0.6 - failed to assign an IP address to container #7318

Open genseb13011 opened 1 week ago

genseb13011 commented 1 week ago

Description

Observed Behavior:

Any pods are stuck in "ContainerCreating" status with below error:

Warning FailedCreatePodSandBox 2m47s (x1642 over 6h) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "71ef22cdaf65163adca4c97ed66df6a7cdcdcbe7c011d0ff62a77648cba5b46b": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

Expected Behavior:

Karpenter should detects that max allowed IP is reached on the node and provision a new one.

Reproduction Steps:

Versions:

Other informations:

engedaam commented 1 week ago

Can you provide you nodepool and EC2NodeClass configuration? Also have you followed our troubleshooting guide around this issue? https://karpenter.sh/docs/troubleshooting/#cni-is-unable-to-allocate-ips-to-pods

genseb13011 commented 1 week ago

Thanks for your answer.

Please, find below nodepool and ec2nodeclass configurations.

Nodepool configuration

apiVersion: karpenter.sh/v1
kind: NodePool
  name: knodes-app-spot-nodepool
spec:
  disruption:
    budgets:
    - nodes: 10%
    consolidateAfter: 2m
    consolidationPolicy: WhenEmptyOrUnderutilized
  template:
    metadata: {}
    spec:
      expireAfter: 720h0m0s
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
      - key: karpenter.k8s.aws/instance-hypervisor
        operator: In
        values:
        - nitro
      - key: kubernetes.io/arch
        operator: In
        values:
        - arm64
        - amd64
      - key: nodegroup
        operator: In
        values:
        - knodes-app
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values:
        - r
        - x
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
      - key: topology.kubernetes.io/zone
        operator: In
        values:
        - eu-west-1c
      - key: karpenter.k8s.aws/instance-cpu
        operator: Gt
        values:
        - "2"
      - key: karpenter.k8s.aws/instance-cpu
        operator: Lt
        values:
        - "33"
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
  weight: 100

EC2NodeClass:

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
  name: default
spec:
  amiSelectorTerms:
  - alias: al2@latest
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      deleteOnTermination: true
      encrypted: true
      kmsKeyID: arn:aws:kms:xxxxx:xxxxxxxxx:key/xxxxxxxxxxxxxxxxxx
      volumeSize: 200Gi
      volumeType: gp3
  instanceProfile: nodes-karpenter-NodeInstanceProfile
  kubelet:
    evictionHard:
      memory.available: 2%
      nodefs.available: 10%
      nodefs.inodesFree: 5%
    evictionSoft:
      memory.available: 3%
    evictionSoftGracePeriod:
      memory.available: 2m0s
    podsPerCore: 12
    systemReserved:
      memory: 300Mi
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required
  securityGroupSelectorTerms:
  - id: sg-xxxxxxxxxxxxxxxxx
  subnetSelectorTerms:
  - id: subnet-xxxxxxxxxxxxxxx
  - id: subnet-xxxxxxxxxxxxxxx
  - id: subnet-xxxxxxxxxxxxxxx

N.B: I've added "podsPerCore" to "fix" the issue temporarly

Yes I've read the troubleshooting section but:

Thanks again

genseb13011 commented 1 week ago

I confirm that we don't use "Security Groups per Pod" feature.

Another thing to mention is that, when the last issue occurs:

so pods number and IPs number were not aligned (don't know if this behaviour is "normal")

Seb.

genseb13011 commented 1 week ago

I'm adding information about my issue:

Even with my "podsPerCore: 12" I'm still facing an issue.

The instance type is "r5b.8xlarge" (240 IPs max.)

Same behaviour: only 157 pods assigned to it and 232 secondary IP + 8 "primary".