aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.18k stars 852 forks source link

Karpenter always change all node at 11:50 PM (UTC) #6325

Open ariretiarno opened 1 month ago

ariretiarno commented 1 month ago

Description

Observed Behavior: Karpenter always change whole nodes on nodepool at 11:50PM (UTC). It makes my app is down. Even my nodepool disruption is have to change to whenEmpty and consolidateAfter: Never karpenter always changing the whole nodes

Expected Behavior: Karpenter didn't change whole nodes.

Reproduction Steps (Please include YAML):

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  annotations:
    karpenter.sh/nodepool-hash: "11054168974092351872"
    karpenter.sh/nodepool-hash-version: v1
  labels:
    kustomize.toolkit.fluxcd.io/name: flux-system
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: arm-node
spec:
  disruption:
    consolidateAfter: Never
    consolidationPolicy: WhenEmpty
    expireAfter: Never
  template:
    spec:
      kubelet:
        evictionHard:
          memory.available: 2%
          nodefs.available: 10%
          nodefs.inodesFree: 10%
        evictionMaxPodGracePeriod: 1200
        evictionSoft:
          memory.available: 5%
          nodefs.available: 15%
          nodefs.inodesFree: 15%
        evictionSoftGracePeriod:
          memory.available: 1m
          nodefs.available: 1m30s
          nodefs.inodesFree: 2m
        maxPods: 60
      nodeClassRef:
        name: default-30g
      requirements:
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values:
        - m
        - r
        - t
      - key: node.kubernetes.io/instance-type
        operator: NotIn
        values:
        - t4g.medium
        - r6g.xlarge
        - t4g.micro
        - t4g.small
        - t4g.nano
        - c7g.xlarge
        - m7i-flex.xlarge
        - m7i-flex.large
        - m7i.large
        - m7i.xlarge
        - r7i.large
        - r7i.xlarge
      - key: karpenter.k8s.aws/instance-generation
        operator: Gt
        values:
        - "2"
      - key: evermos.com/serviceClass
        operator: In
        values:
        - arm-node
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
      - key: karpenter.k8s.aws/instance-cpu
        operator: In
        values:
        - "4"
      - key: kubernetes.io/arch
        operator: In
        values:
        - arm64
      - key: karpenter.k8s.aws/instance-generation
        operator: NotIn
        values:
        - "7"
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      startupTaints:
      - effect: NoExecute
        key: node.cilium.io/agent-not-ready
        value: "true"
      taints:
      - effect: NoSchedule
        key: arm
        value: "true"

EC2 Nodeclass

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  annotations:
    karpenter.k8s.aws/ec2nodeclass-hash: "17085163873928377628"
    karpenter.k8s.aws/ec2nodeclass-hash-version: v1
  labels:
    kustomize.toolkit.fluxcd.io/name: flux-system
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: default-30g
spec:
  amiFamily: AL2
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      encrypted: true
      volumeSize: 30Gi
      volumeType: gp3
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required
  role: KarpenterNodeRole-evermos-prod
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: evermos-prod
  subnetSelectorTerms:
  - tags:
      Name: Zone-A

Logs

image

Versions:

jonathan-innis commented 1 month ago

Do you have logs that you can share around when the terminations happen and what Karpenter is "marking the node as" when it is rolling them? Are the nodes getting marked as drifted? Are they expiring? Though to tell from what you shared above since those are only the launch logs, so a more verbose, longer log dump would help out a lot here.

github-actions[bot] commented 1 week ago

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.