aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.79k stars 956 forks source link

Pause/Resume Karpenter nodes for cost-savings #7010

Closed JohnPolansky closed 3 weeks ago

JohnPolansky commented 1 month ago

Description

What problem are you trying to solve? We currently use Karpenter to manage around 15 clusters, however only 5 of them really need to run 24/7. The rest are various clusters used for development/testing. We are trying to find a way to reduce costs by stopping the clusters when not needed for example turn them off outside business hours. While it's possible to use the --replicas 0 feature to stop pods and karpenter will eventually remove the nodes that are not required, this requires updating dozen's of deployments/sts and to scale it back up you have to specify the right number of replicas. What we are trying to find is a way to simply tell karpenter to "pause" and shutdown all nodes until we "resume" karpenter.

As an example for AWS node-groups you could use a command like:

aws eks update-nodegroup-config \
  --scaling-config=minSize=0,maxSize=3,desiredSize=0

This causes the AWS nodegroup to shutdown all its nodes and all deployments/sts that should be running go into a PENDING state. Is there a feature of karpenter to do a similar function? If we remove all nodes then we are effectively remove 90% of the cluster cost so it's seems like it could be a valuable feature.

Having a scheduled feature for this would be great too, but even a manual command we could build a cronjob around would be handy.

How important is this feature to you?

dcherniv commented 1 month ago

Not natively, but you can do something like this:

{{- if .Values.scaledown.enabled }}
apiVersion: v1
kind: ConfigMap
metadata:
  name: scale
  namespace: karpenter
data:
  scale.sh: |
    #!/bin/sh

    NODEPOOL_PREFIX={{ .Values.scaledown.nodepoolPrefix }}

    if [ "$1" = "down" ] ; then
      echo "Patching nodepools to scale down to 0"
      echo "Only considering pools named default for scale down"
      for i in $(kubectl get nodepools --no-headers -o NAME | grep $NODEPOOL_PREFIX) ; do
        kubectl patch $i --type merge --patch '{"spec": {"limits": {"cpu": "0"}}}'
      done
      kubectl delete nodeclaims --all &
      echo "Waiting for claims to be deleted... Sleeping for 300 seconds"
      sleep 300
      echo "Removing straggler pods that block node deletions"

      kubectl get pods --no-headers -A -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name | grep -v karpenter |\
      while read -r line; do
        kubectl delete pod --grace-period=0 --force -n $line
      done
    fi

    if [ "$1" = "up" ] ; then
      for i in $(kubectl get nodepools --no-headers -o NAME | grep $NODEPOOL_PREFIX) ; do
        #TODO: figure out how to templatize the upper limit.
        #      its always 1000 in dev but in prod, its different
        echo "Patching nodepools to scale up to 1000"
        kubectl patch $i --type merge --patch '{"spec": {"limits": {"cpu": "{{ .Values.scaledown.originalNodepoolSize}}"}}}'
      done
      echo "Waiting for claims to be created... Sleeping for 300 seconds"
      sleep 300
      echo "Deleting all pods to force fair scheduling"
      kubectl get pods --no-headers -A -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name | grep -v karpenter |\
      while read -r line; do
        kubectl delete pod --grace-period=0 --force -n $line
      done
    fi
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-down-nodepools
  namespace: karpenter
spec:
  schedule: "{{ .Values.scaledown.cronjob.downSchedule }}"
  {{- if .Values.scaledown.cronjob.timeZone }}
  timeZone: "{{ .Values.scaledown.cronjob.timeZone}}"
  {{- end }}
  jobTemplate:
    spec:
      template:
        spec:
          priorityClassName: system-node-critical
          serviceAccount: karpenter
          tolerations:
            - key: CriticalAddonsOnly
              operator: Exists
            - key: "arch"
              operator: "Equal"
              value: "arm64"
              effect: "NoSchedule"
          volumes:
          - name: config
            configMap:
              name: scale
              defaultMode: 0777

          containers:
          - name: kubectl
            image: {{ .Values.scaledown.cronjob.image }}
            imagePullPolicy: IfNotPresent

            volumeMounts:
            - name: config
              mountPath: "/scripts"

            command:
            - /bin/sh
            - -c
            - /scripts/scale.sh down

          restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-up-nodepools
  namespace: karpenter
spec:
  schedule: "{{ .Values.scaledown.cronjob.upSchedule }}"
  {{- if .Values.scaledown.cronjob.timeZone }}
  timeZone: "{{ .Values.scaledown.cronjob.timeZone }}"
  {{- end }}
  jobTemplate:
    spec:
      template:
        spec:
          priorityClassName: system-node-critical
          serviceAccount: karpenter
          tolerations:
            - key: CriticalAddonsOnly
              operator: Exists
            - key: "arch"
              operator: "Equal"
              value: "arm64"
              effect: "NoSchedule"
          volumes:
          - name: config
            configMap:
              name: scale
              defaultMode: 0777

          containers:
          - name: kubectl
            image: {{ .Values.scaledown.cronjob.image }}
            imagePullPolicy: IfNotPresent

            volumeMounts:
            - name: config
              mountPath: "/scripts"

            command:
            - /bin/sh
            - -c
            - /scripts/scale.sh up

          restartPolicy: OnFailure
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: karpenter-pod-admin
  labels:
    rbac.authorization.k8s.io/aggregate-to-admin: "true"
    rbac.authorization.k8s.io/aggregate-to-edit: "true"
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: karpenter-pod-admin
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: karpenter-pod-admin
subjects:
- kind: ServiceAccount
  name: karpenter
  namespace: {{ .Release.Namespace }}
{{- end }}
njtran commented 1 month ago

Could you just delete and re-apply your NodePools? If all of your NodePools are deleting/deleted, Karpenter won't have a place to launch, and eventually all the NodeClaims would be garbage collected, drained, and deleted. You wouldn't have to scale down your deployments, and you could simply just spin up your NodePools when you're ready to re-enable compute provisioning.

JohnPolansky commented 1 month ago

@njtran Hrm, I did try and delete the nodepools in one attempt but i didn't think that would have enough impact on it's own but I didn't wait for long. I could give it a try and see how that works. To "re-enable" you said "spin up your nodepools" I assume you mean to kubectl apply -f nodepool.yaml to re apply them.

@dcherniv Thanks for your detailed option. I need to take some time to sort out how it works and try it out but it looks promising.

JohnPolansky commented 3 weeks ago

Hey all, first I wanted to say thanks for the various ideas, they were great. in the end we went with a slightly similar solution that so far appears to be working very well for us. I can't say it will work for everyone because it does assume our setup and node-group usage for karpenter.

In our case we use an EKS node-group which has a AWS auto-scaling-group set to 5 nodes, this houses some of course core-services like Karpenter, coredns, etc. Karpenter is then responsible for turning up nodes for everything.

Our solution was this:

  1. aws-cli to set the scaling-group to 0 (effectively killing Karpenter so it can't turn up new nodes)
  2. Then we wait for those scaling-group nodes to stop via checking a label attached to those nodes
  3. Then we loop through the remaining nodes:
  4. kubectl drain "${NODE}" --force --ignore-daemonsets --delete-emptydir-data --disable-eviction --grace-period 0
  5. kubectl patch node "${NODE}" -p '{"metadata":{"finalizers":[]}}' --type=merge
  6. kubectl delete node "${NODE}"
  7. At this point the nodes are deleted from Kubernetes BUT the EC2 instances are left running in AWS
  8. aws ec2 terminate-instances --no-cli-pager --profile "${SCRIPT_AWS_ACCOUNT_NAME}" --region "${SCRIPT_AWS_REGION}" --instance-ids "$(aws ec2 describe-instances --profile "${SCRIPT_AWS_ACCOUNT_NAME}" --region "${SCRIPT_AWS_REGION}" --filters Name=tag:Name,Values="${NODE}" --query 'Reservations[].Instances[].InstanceId' --output text)"

Then to resume we simply:

  1. aws-cli to set the node-group back to 5
  2. The nodes come up and karpenter starts automatically
  3. Karpenter then sees all the PENDING pods and starts them up

For us this solution appears to be working well. But I do want to stress 2 things.

  1. This solution doesn't allow for safe-shutdown of pods it effectively kills them immediately some applications may not respond well to this.
  2. As mentioned before this all relies on our scaling-group for karpenter other setups may not be able to use this directly.

Hope this helps someone and thanks!