aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.15k stars 849 forks source link

after some time Karpenter removes node that has terminationGracePeriodSeconds=12hr and safe-to-evict: "false" #6394

Open qxmips opened 1 week ago

qxmips commented 1 week ago

Description

Observed Behavior: We use a KEDA scaled deployment that can process messages for up to 12 hours. Therefore, the pod has terminationGracePeriodSeconds: 43200. To avoid node disruption by Karpenter, the pod has the cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation.

When the pod was running, Karpenter couldn't remove it as expected (karpenter DisruptionBlocked: Cannot disrupt Node). However, when autoscale tried to remove the pod, it entered the Terminating state while continuing to work due to the termination grace period. (all good so far)

Karpenter then reported karpenter: FailedDraining: Failed to drain node, 5 pods are waiting to be evicted, but the node remained ready, and the pod continued running, waiting either to gracefully shut down after completing its job or for the terminationGracePeriodSeconds to elapse.

After approximately 6-7 hours, Karpenter forcefully removed the node, resulting in the pod becoming orphaned with the message "Found orphaned Pod assigned to the Node, deleting." Which was not expected.

Expected Behavior: Respect long terminationGracePeriodSeconds, I didn't find any documenter timeouts that could be configured to mitigate this behavior. Reproduction Steps (Please include YAML):

# kubectl create ns test
# kubectl scale deploy test  --replicas=0 -n test

# # kubectl get po -n test
# ```
# test-bf8f7bb5f-dvlx2   1/1     Terminating   0          3m10
# ````
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
  name: test
  namespace: test
spec:
  progressDeadlineSeconds: 600
  replicas: 0
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      app: test
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
        karpenter.sh/do-not-disrupt: "true"
      creationTimestamp: null
      labels:
        app: test
    spec:
      containers:
      - env:
        - name: PRETEND_WORKING
          value: "21600"
        image: busybox
        command: ["/bin/sh", "-c", 'trap "echo Termination signal received, sleeping for $PRETEND_WORKING seconds; sleep $PRETEND_WORKING; exit" TERM; echo Running; while true; do sleep 10; done']
        imagePullPolicy: IfNotPresent
        name: test
        resources:
          limits:
          requests:
            cpu: "6"
            memory: 13000Mi
      nodeSelector:
        kubernetes.io/os: linux
        karpenter.k8s.aws/instance-family: c6a
      restartPolicy: Always
      terminationGracePeriodSeconds: 43200 #6hrs ,   43200 -  12hrs
    apiVersion: karpenter.k8s.aws/v1beta1
    kind: EC2NodeClass
    metadata:
      name: al2023
    spec:
      amiFamily: AL2023
      role: <>
      blockDeviceMappings:
        - deviceName: /dev/xvda
          ebs:
            volumeSize: 150Gi
            volumeType: gp3
            encrypted: true
            deleteOnTermination: true
      subnetSelectorTerms:
        - tags:
            karpenter.sh/discovery: <>
        - tags:
            zone: private
      securityGroupSelectorTerms:
        - tags:
            karpenter.sh/discovery: <>
        - tags:
            aws:eks:cluster-name: <>
      tags:
        karpenter.sh/discovery: <>
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: on-demand-general-amd64
spec:
  template:
    spec:
      requirements:
          - key: kubernetes.io/arch
            operator: In
            values: ["amd64"]
          - key: karpenter.k8s.aws/instance-category
            operator: In
            values: ["c", "m", "r"]
          - key: "karpenter.sh/capacity-type"
            operator: In
            values: ["on-demand"]
          - key: "karpenter.k8s.aws/instance-cpu"
            operator: In
            values: [ "8", "16", "32"]
      nodeClassRef:
        name: default
      disruption:
        consolidationPolicy: WhenUnderutilized
        expireAfter: 720h

in cloud trail you will see that the node is killed by kaptenter after some time.

and in eks cloudwatch logs:

03:49:10.074952 garbagecollector.go:533] "Processing item" item="[coordination.k8s.io/v1/Lease, namespace: kube-node-lease, name: ip-10-11-130-134.ec2.internal, uid: af195770-d0ea-49a1-8a1b-ebdc27b8d49b]" virtual=false

When a node is deleted or becomes unresponsive, its associated Lease object may be cleaned up by the garbage collector. The processing of this Lease object indicates that the garbage collector is handling the cleanup of resources related to the node ip-10-11-130-134.ec2.internal.

03:49:10.076220      10 node_tree.go:79] "Removed node in listed group from NodeTree" node="ip-10-11-130-134.ec2.internal" zone="us-east-1:\x00:us-east-1c"

03:49:14.040171      11 controller_utils.go:173] "Recording event message for node" event="Removing Node ip-10-11-130-134.ec2.internal from Controller" node="ip-10-11-130-134.ec2.internal"
03:49:14.040141      11 node_lifecycle_controller.go:685] "Controller observed a Node deletion" node="ip-10-11-130-134.ec2.internal"
 03:50:00.249132      11 gc_controller.go:154] failed to get node ip-10-11-130-134.ec2.internal : node "ip-10-11-130-134.ec2.internal" not found 
03:50:00.357529      11 gc_controller.go:335] "PodGC is force deleting Pod" pod="prod/hdr-service-app-c9cdb8dbf-w2hr2  
03:50:00.357520      11 gc_controller.go:246] "Found orphaned Pod assigned to the Node, deleting." pod="prod/hdr-service-app-c9cdb8dbf-w2hr2" node="ip-10-11-130-134.ec2.internal"
03:55:09.629854      11 EKS doesn't see node: node_status_updater.go:91] "Could not update node status. Failed to find node in NodeInformer cache" node="ip-10-11-130-134.ec2.internal" err="node \"ip-10-11-130-134.ec2.internal\" not found"

Versions:

jigisha620 commented 1 week ago

Do you happen to have more complete set of Karpenter controller logs from the time when this happened?

qxmips commented 1 week ago

Do you happen to have more complete set of Karpenter controller logs from the time when this happened?

not much info, except for the messages that the node got deleted

22:18:57.334 disrupting via consolidation delete, terminating 1 nodes (1 pods) ip-10-11-130-134.ec2.internal/m5a.2xlarge/on-demand

...
03:49:10.814 deleted node

image

jigisha620 commented 1 week ago

From the logs it seems like the pod that was orphaned is prod/hdr-service-app-c9cdb8dbf-w2hr2. Just wanted to confirm that the pod spec that you have shared is same for this pod since the deployment is called test in that.

qxmips commented 1 week ago

From the logs it seems like the pod that was orphaned is prod/hdr-service-app-c9cdb8dbf-w2hr2. Just wanted to confirm that the pod spec that you have shared is same for this pod since the deployment is called test in that.

yeah sorry. the attached log was from the original issue with a production service. but basically prod/hdr-service-app-c9cdb8dbf-w2hr2 had the same issue. here are the logs from the reproduces issue test:

E0622 04:08:15.200303      11 gc_controller.go:154] failed to get node ip-10-11-57-209.ec2.internal : node "ip-10-11-57-209.ec2.internal" not found
I0622 04:09:15.225784      11 gc_controller.go:246] "Found orphaned Pod assigned to the Node, deleting." pod="kube-system/aws-node-d47hp" node="ip-10-11-57-209.ec2.internal"
I0622 04:09:15.281401      11 gc_controller.go:246] "Found orphaned Pod assigned to the Node, deleting." pod="test/test-5ccdb7cd7f-dm9bq" node="ip-10-11-57-209.ec2.internal"
I0622 04:09:15.303497      11 gc_controller.go:246] "Found orphaned Pod assigned to the Node, deleting." pod="kube-system/ebs-csi-node-xt68r" node="ip-10-11-57-209.ec2.internal"
I0622 04:08:07.894293      10 node_tree.go:79] "Removed node in listed group from NodeTree" node="ip-10-11-57-209.ec2.internal" zone="us-east-1:\x00:us-east-1a"
--
jigisha620 commented 1 week ago

terminationGracePeriodSeconds: 43200 #6hrs , 43200 - 12hrs

The deployment you shared has this. Is there a reason that this comment says 6 hours? Was terminationGracePeriod set to 6 hours or 12?

qxmips commented 1 week ago

terminationGracePeriodSeconds: 43200 #6hrs , 43200 - 12hrs

The deployment you shared has this. Is there a reason that this comment says 6 hours? Was terminationGracePeriod set to 6 hours or 12?

terminationGracePeriodSeconds is set to 12hrs (43200 seconds). we need it as on rare occasions the pod can't be interupted for up to 12 hours.

jigisha620 commented 1 week ago

I believe at this point it would make sense to go over the cluster audit logs to see if there's something that indicates what went wrong. Do you mind opening a support ticket to facilitate this?

qxmips commented 6 days ago

I believe at this point it would make sense to go over the cluster audit logs to see if there's something that indicates what went wrong. Do you mind opening a support ticket to facilitate this?

sorry, what kind of support ticket do you mean? we don't have AWS premium support. I believe that behavior can be reproduced on any cluster.