aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.21k stars 856 forks source link

amd series node infinite regenerative status #6449

Open jhyoonzi opened 2 weeks ago

jhyoonzi commented 2 weeks ago

Description

Observed Behavior: Added t3.large, t3a.large to the karpenter node.

t3.large is normally poded up by job.batch and normal for nodes.

When the pod is raised by job.batch at t3a.large, the pod briefly enters the Running state and ends, and the Node enters the Schedule disabled state and repeats it indefinitely.

Expected Behavior:

Reproduction Steps (Please include YAML):

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: karpenter-spot-test
spec:
  template:
    metadata:
      labels:
        role: ops
        provision: karpenter
    spec:
      nodeClassRef:
        name: karpenter-spot-test
        apiVersion: karpenter.sh/v1beta1
        kind: EC2NodeClass

      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: [
              "t3.large",
              "t3a.large",
              "t4g.large",
              "m5.large",
              "m5a.large",
            ] 
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]
        - key: "karpenter.k8s.aws/instance-generation"
          operator: "Gt"
          values: ["2"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
  limits:
    cpu: "30000"
    memory: 90000Gi

---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: karpenter-spot-test
spec:
  amiFamily: AL2
  role: KarpenterNodeRole
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        deleteOnTermination: true
        volumeSize: 50Gi
        volumeType: gp3
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: karpenter-spot-test
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: karpenter-spot-test
  tags:
    Name: karpenter.sh/karpenter-spot-test
    karpenter.sh/discovery: karpenter-spot-test
    alpha.eksctl.io/nodegroup-name: karpenter-spot-test

Versions:

jigisha620 commented 1 week ago

I tried to reproduce this on my end and I did not run into any issues. What kind of pod are you trying to run? Does it have any resource requirements that can't be satisfied by t3a.large ? Can you share karpenter controller logs from when this happened? Can you also share what the created nodeClaim looks like?