aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.63k stars 924 forks source link

Karpenter is always disrupting node via drift after upgrading to 0.37.3 from #7049

Open ariretiarno opened 2 days ago

ariretiarno commented 2 days ago

Description

Observed Behavior: Karpenter is always disrupting node via drift, even consolidate configuration is expireAfter: Never and consolidateAfter: Never. This issue is happen after i've upgrade from 0.32.10 to 0.37.3.

And it's happen to whole nodes in the cluster.

Logs


{"level":"INFO","time":"2024-09-20T09:31:25.580Z","logger":"controller","message":"disrupting via drift replace, terminating 1 nodes (1 pods) ip-172-31-125-79.ap-southeast-1.compute.internal/t3a.xlarge/on-demand and replacing with on-demand node from types t3a.xlarge","commit":"378e8b1","controller":"disruption","command-id":"a49138f2-51e9-4bb1-95b1-1973aa9d694f"}
--

Expected Behavior: Karpenter should not disrupt node when i defined

disruption:
    expireAfter: Never
    consolidateAfter: Never

Reproduction Steps (Please include YAML): Nodepoool

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: jenkins-master
spec:
  disruption:
    expireAfter: Never
    consolidateAfter: Never
  template:
    metadata: {}
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values:
        - t
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - t3a.xlarge
      - key: karpenter.k8s.aws/instance-generation
        operator: Gt
        values:
        - "2"
      - key: evermos.com/serviceClass
        operator: In
        values:
        - jenkins-master
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
      - key: karpenter.k8s.aws/instance-cpu
        operator: In
        values:
        - "2"
        - "4"
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      startupTaints:
      - effect: NoExecute
        key: node.cilium.io/agent-not-ready
        value: "true"
      taints:
      - effect: NoSchedule
        key: jenkins-master
        value: "true"

NodeClass

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      encrypted: true
      volumeSize: 50Gi
      volumeType: gp3
  role: KarpenterNodeRole-evermos-dev
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: evermos-dev
  subnetSelectorTerms:
  - tags:
      Name: private-a-dev

Versions:

rschalo commented 3 hours ago

How often are you seeing drift occur? Also, do you have Karpenter logs from when drift occurred? Should see something generated from: https://github.com/kubernetes-sigs/karpenter/blob/v0.37.3/pkg/controllers/nodeclaim/disruption/drift.go#L82