Disruption not pre-spinning?

drawnwren commented 3 months ago

Description

Observed Behavior: We have a nodepool of gpu nodes that has only one pod and expireAfter: 3h (to try and get put back onto spot if we are on an on-demand node). We're seeing karpenter take the node down every 3 hours and then schedule a new pod afterwards. So the first node fully terminates and then the new node is started. We're also having trouble reliably documenting this behavior. We can see it when we're watching the pods w/ watch -n 5 kubectl get all. Is there an easier way to double check what we're seeing? Expected Behavior: We expect karpenter to "Pre-spin any replacement nodes needed as calculated in Step (2), and wait for them to become ready." every 3 hours. Reproduction Steps (Please include YAML):

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: g5-family
spec:
  template:
    metadata:
      labels:
        xoul-ai: single-medium-gpu
    spec:
      nodeClassRef:
        name: g5-family
      # Define your provisioner spec here
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot', 'on-demand']
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ['g5.xlarge']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: 'NoSchedule'
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 3h
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: g5-family
  annotations:
    kubernetes.io/description: 'Medium, single GPU instance for running background models'
spec:
  amiFamily: AL2
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 150Gi
        volumeType: gp3
        encrypted: true
  role: 'KarpenterNodeRole-{{ .Values.settings.clusterName }}'
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: '{{ .Values.settings.clusterName }}'
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: '{{ .Values.settings.clusterName }}'

Versions:

Chart Version:
Kubernetes Version (kubectl version):

jmdeal commented 3 months ago

Just to clarify what Karpenter's pre-spin does, we will launch a node and wait for it to become ready before beginning to drain the disrupted node. However, when Karpenter drains a node, it will not launch a new pod on the new node and wait for it to become ready before evicting the previous pod. The pre-spin exists to make this down-time as short as possible, but there will still be downtime. Karpenter does respect PDBs which should be configured to ensure high availability. We might be on the same page already, but I wanted to clarify since you mentioned you can see this from watching your pods where I would expect temporary downtime.

If the replacement node isn't initialized before we begin terminating the replacement, could you share your logs and Karpenter version?

drawnwren commented 3 months ago

We didn't have PDBs implemented on our pods. I've added it now, but we did have HPAs w/ min: 1. Would lack of pdb explain our issue?

drawnwren commented 3 months ago

And yes, the behavior we were seeing was that a new pod would not be scheduled until the old pod had completely terminated.

jmdeal commented 3 months ago

Yep, that's expected behavior. Karpenter doesn't actually create any new pods when it terminates a node, it evicts all pods running on the node using the Eviction API. Once those pods are evicted, whatever is responsible for managing their lifetime may create new pods in response (e.g. the replicaset controller). The eviction API does respect PDBs though, so creating a PDB with minAvailable: 1 may meet your requirements.

github-actions[bot] commented 3 months ago

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

aws / karpenter-provider-aws

Disruption not pre-spinning? #6362

Description