aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.63k stars 923 forks source link

Nodes are stuck in `Instance is terminating` for more than 5 minutes #7019

Open gaelayo opened 1 week ago

gaelayo commented 1 week ago

Description

Observed Behavior: Instance takes more than 5 minutes to terminate. Not sure if this is a bug or to be expected, but this sounds quite long (especially since we run SpotToSpotConsolidation, which leads to a lot of volatility in our pods.

Expected Behavior: The node should be deleted quickly.

Reproduction Steps (Please include YAML): Karpenter deployment yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "16"
    meta.helm.sh/release-name: karpenter
    meta.helm.sh/release-namespace: karpenter
  creationTimestamp: "2023-05-09T13:21:52Z"
  generation: 16
  labels:
    app.kubernetes.io/instance: karpenter
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: karpenter
    app.kubernetes.io/version: 1.0.1
    helm.sh/chart: karpenter-1.0.1
  name: karpenter
  namespace: karpenter
  resourceVersion: "612913263"
  uid: ee206f21-41e2-4481-a159-b94f80842e68
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: karpenter
      app.kubernetes.io/name: karpenter
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: karpenter
        app.kubernetes.io/name: karpenter
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: karpenter.sh/nodepool
                operator: DoesNotExist
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/instance: karpenter
                app.kubernetes.io/name: karpenter
            topologyKey: kubernetes.io/hostname
      containers:
      - env:
        - name: KUBERNETES_MIN_VERSION
          value: 1.19.0-0
        - name: KARPENTER_SERVICE
          value: karpenter
        - name: WEBHOOK_PORT
          value: "8443"
        - name: WEBHOOK_METRICS_PORT
          value: "8001"
        - name: DISABLE_WEBHOOK
          value: "false"
        - name: LOG_LEVEL
          value: debug
        - name: METRICS_PORT
          value: "8080"
        - name: HEALTH_PROBE_PORT
          value: "8081"
        - name: SYSTEM_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: MEMORY_LIMIT
          valueFrom:
            resourceFieldRef:
              containerName: controller
              divisor: "0"
              resource: limits.memory
        - name: FEATURE_GATES
          value: SpotToSpotConsolidation=true
        - name: BATCH_MAX_DURATION
          value: 10s
        - name: BATCH_IDLE_DURATION
          value: 1s
        - name: CLUSTER_NAME
          value: cluster-name
        - name: VM_MEMORY_OVERHEAD_PERCENT
          value: "0.075"
        - name: INTERRUPTION_QUEUE
          value: Karpenter-cluster-name
        - name: RESERVED_ENIS
          value: "0"
        image: public.ecr.aws/karpenter/controller:1.0.1@sha256:fc54495b35dfeac6459ead173dd8452ca5d572d90e559f09536a494d2795abe6
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: http
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 30
        name: controller
        ports:
        - containerPort: 8080
          name: http-metrics
          protocol: TCP
        - containerPort: 8001
          name: webhook-metrics
          protocol: TCP
        - containerPort: 8443
          name: https-webhook
          protocol: TCP
        - containerPort: 8081
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /readyz
            port: http
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 30
        resources:
          limits:
            cpu: 250m
            memory: 1700Mi
          requests:
            cpu: 250m
            memory: 1700Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 65532
          runAsNonRoot: true
          runAsUser: 65532
          seccompProfile:
            type: RuntimeDefault
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName: system-cluster-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 65532
      serviceAccount: karpenter
      serviceAccountName: karpenter
      terminationGracePeriodSeconds: 30
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/instance: karpenter
            app.kubernetes.io/name: karpenter
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: "2023-07-05T11:27:38Z"
    lastUpdateTime: "2023-07-05T11:27:38Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2024-09-05T11:27:26Z"
    lastUpdateTime: "2024-09-16T14:00:08Z"
    message: ReplicaSet "karpenter-5cc6587844" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 16
  readyReplicas: 2
  replicas: 2
  updatedReplicas: 2

When deleting a node, it quickly show Instance is terminating. But then takes more than 5 minutes to actually delete the pod.

   Normal   DisruptionTerminating    5m38s                karpenter              Disrupting Node: Empty/Delete                                                                                            
│   Warning  FailedDraining           5m38s                karpenter              Failed to drain node, 9 pods are waiting to be evicted                                                                   │
│   Warning  InstanceTerminating      5m34s                karpenter              Instance is terminating                                                                                                  │
│   Normal   NodeNotReady             4m54s                node-controller        Node ip-10-2-18-42.eu-west-1.compute.internal status is now: NodeNotReady                                                │
│   Normal   DisruptionBlocked        89s (x3 over 5m38s)  karpenter              Cannot disrupt Node: state node is marked for deletion  

AFAIK the 9 pods that are waiting to be evicted are daemonsets (such as prometheus, nvidia gpu plugin, nvidia NFD, GPF, and also aws-node, ebs-csi, ...)

I see that quite rapidly, the only 5 pods remaining on the node are aws-node, ebs-csi-node, kube-proxy, monitoring-prometheus-node-exporter, node-problem-detector.

I enabled debug logging on karpenter, but I cannot see anything related to the node except the following lines:

karpenter-5cc6587844-cmqzj {"level":"INFO","time":"2024-09-16T14:05:11.982Z","logger":"controller","caller":"termination/controller.go:103","message":"tainted node","commit":"62a726c","controller":"node.termination","controllerGroup":"","controllerKind":"Node","Node":{"name":"ip-10-2-18-42.eu-west-1.compute.internal"},"namespace":"","name":"ip-10-2-18-42.eu-west-1.compute.internal","reconcileID":"6f51e1fb-2538-4fd3-b623-9614a1edd79b","taint.Key":"karpenter.sh/disrupted","taint.Value":"","taint.Effect":"NoSchedule"}

<5 minutes later...>

karpenter-5cc6587844-cmqzj {"level":"INFO","time":"2024-09-16T14:11:25.451Z","logger":"controller","caller":"termination/controller.go:160","message":"deleted node","commit":"62a726c","controller":"node.termination","controllerGroup":"","controllerKind":"Node","Node":{"name":"ip-10-2-18-42.eu-west-1.compute.internal"},"namespace":"","name":"ip-10-2-18-42.eu-west-1.compute.internal","reconcileID":"67cf99ae-4b57-479d-82d0-32412a5a174e"}
karpenter-5cc6587844-cmqzj {"level":"INFO","time":"2024-09-16T14:11:25.789Z","logger":"controller","caller":"termination/controller.go:79","message":"deleted nodeclaim","commit":"62a726c","controller":"nodeclaim.termination","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"private-provisioner-gpu-sphwd"},"namespace":"","name":"private-provisioner-gpu-sphwd","reconcileID":"de40ce45-d302-46aa-8165-734e2bfcdfbd","Node":{"name":"ip-10-2-18-42.eu-west-1.compute.internal"},"provider-id":"xxx"}

Versions:

njtran commented 5 days ago

Do you have PDBs/do-not-disrupt pods setup? Those would slow down the rate that Karpenter can drain your node. In v1, we also wait for the instance to be fully terminated before removing the node/nodeclaim, so you might be seeing that as well. That ensures that all applications are cleaned up before we go ahead deregister the node from the cluster.

You may be interested in setting terminationGracePeriod on your NodePool.Spec.Template.Spec.TerminationGracePeriod to set a timeout on how long Karpenter can be draining a node before it's forcibly cleaned up. https://karpenter.sh/docs/concepts/disruption/#terminationgraceperiod

gaelayo commented 4 days ago

I do not have PDBs on the 5 pods that remains on the node while it is "stuck" (aws-node, ebs-csi-node, kube-proxy, monitoring-prometheus-node-exporter, node-problem-detector). They are all from DaemonSets. However, some of these pods are in the priority class system-node-critical`, if this matters.

Thank you for the link to TerminationGracePeriod, this may be what I end up using, even if I would prefer to understand why some pods seems to be blocking draining. I am not even sure that "blocking draining" are the right words, because when I look at the logs, I see:

  Warning  FailedDraining           64s                karpenter              Failed to drain node, 9 pods are waiting to be evicted                                                                     │
│   Warning  InstanceTerminating      51s                karpenter              Instance is terminating

Which, as I understand it, means that all pods were evicted after 13s, and that the instance was terminating. Writing this, I wonder if this is just an issue with AWS taking too long to terminate the instance ? I'll try to monitor the status of the AWS instance alongside karpenter to see if the instance if really placed in a Terminating state on the AWS side of things,