How to tear down cluster? (Can Karpenter use an ASG, or be a TF provider?)

benjimin commented 4 months ago

Description

What problem are you trying to solve?

If using infra-as-code (i.e. terraform/opentofu) to provision a cluster (e.g. EKS) that uses Karpenter, then how should the cluster be torn down without leaving orphan node instances?

Would it be possible to configure Karpenter to attach all instances to an existing ASG, so that terraform can remove the entire cluster (including the node instances) by deleting that ASG, so as to automatically cause AWS to terminate and clean up all those instances?

Alternatively, is there interest in maintaining a Karpenter TF provider, that configures Karpenter to tag its EC2 instances sufficiently distinctly that this provider can clean up the instances (along with associated volumes etc) when TF deprovisions the cluster? (Note the existing AWS Karpenter terraform module does not perform any deprovisioning of instances.)

At minimum I think the docs should explicitly discuss the cluster teardown process, and the options for achieving this in an infra-as-code context.

How important is this feature to you?

Our previous set up was to use cluster autoscaler with ASGs. Since the ASGs were provisioned by terraform (e.g. in the same module that provisions the EKS cluster or control plane), the ASG gets deleted when the cluster gets deleted, and so the instances get cleaned up automatically.

This setup meant we could use terraform, in CI pipelines, to dynamically spin up a new cluster (e.g. for testing) and to tear it down again cleanly. We did not need a separate tool (such as a lambda like suggested in #1134) to ensure the cleanup. Frankly it would be difficult to implement such a lambda reliably and cleanly (including ensuring the lambda is not deprovisioned just when it is needed to perform its task).

This feature (either an ASG or a TF provider) could also help with bootstrapping a cluster (i.e., ensuring an initial first node is available for running Karpenter itself). It is difficult to imagine how to cleanup a cluster without the feature, as Karpenter presumably has no internal means to deprovision the last node from which Karpenter itself runs. (It is also inconvenient and fault prone if cluster deprovisioning expects k8s API operations, such as drains or Karpenter config updates, to occur in coordination with the cloud API operations such as EKS deprovisioning.)

A TF provider would perhaps be the superior solution, as it would most greatly simplify the cluster bootstrapping instructions (whereas coordinating setup of an ASG would be fiddly), closely mirror the bootstrapping of flux (another core k8s infra-as-code component), avoid any race condition (of the karpenter pod getting killed between launching an instance and attaching it to an ASG), be fairly easy to implement, and easy to distribute.

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

jonathan-innis commented 4 months ago

When you are orchestrating the tear-down of a cluster, what's the order or operations that you are executing to get it fully spun down? For newer version of Karpenter, NodeClaims should be attached to the instance launch path, which means that when the NodeClaims are deleted, they won't be fully removed until the instance is terminating. If you have your cluster wait for the deletion of the NodeClaims before being spun down, that should avoid leaking instances.

My assumption is that you are deleting the Karpenter installation and the cluster simultaneously and there is currently no blocking operation against the cluster for Karpenter's deletion like there would be with MNG/ASGs.

jonathan-innis commented 4 months ago

I'd also agree that we should probably add this as part of the "Delete the Cluster" section in the "Getting Started Guide" to make it more clear what needs to be done to avoid hanging instances here.

Realistically, I think that the teardown process should look like:

Delete all the CRDs associated with Karpenter on your cluster, this should cascade delete the NodePools and NodeClaims that you have present. This will cause the Nodes to get drained gracefully and the instances to get terminated once the nodes have finished draining.
Delete the Karpenter installation
Delete the cluster.

As long as you delete the CRDs while the Karpenter installation is still running and before you start terminating the cluster, this should avoid leaking any EC2 instances into your account.

jonathan-innis commented 4 months ago

Maybe also take a look at https://github.com/aws-samples/karpenter-blueprints or https://aws-ia.github.io/terraform-aws-eks-blueprints/patterns/karpenter/ and see if either of those provide enough substance to stand-up a Karpenter instance (and deprovision it) using a TF provider.

benjimin commented 4 months ago

The way terraform/opentofu is used is that you define (expressed in a universal configuration language) what infrastructure you want to have exist (e.g. an EKS cluster, along with subnets, load balancers, security controls, etc). When you use terraform to "apply" this definition, it creates those resources (if they do not already exist). To de-provision those cloud resources you either: update the definition (to omit the cluster) and reapply, or you pass terraform a destroy command (equivalent to blanking the entire definition to be applied). Note that it will only decommission whatever remote infrastructure it can recognise as having been deployed by the same terraform instance.

Commonly, this definition would live in a git repo with terraform invoked from the CI. It is also common that this terraform definition would also specify for fluxCD to be installed in the cluster, so that flux will then sync k8s resources (such as namespaces, deployments, services, etc) to the cluster from yaml manifests in another git repo. (Incidentally, flux support for deletion of CRDs is very limited.) This is how infrastructure as code is generally achieved in the context of kubernetes, and is cloud provider agnostic.

@jonathan-innis both those sample patterns/blueprints you suggest do not use terraform to clean up the EC2 instances created by karpenter. One of them advises CRD deletions like you did, the other advises deleting all workloads. There are two basic problems with these approaches:

Such approaches are imperative not declarative (they involve specific commands that must be performed in a specific sequence in relation to de-provisioning the EKS cluster), and are antithetical with conventional terraform workflows (which leverage terraform's central ability to cleanup after itself at the appropriate time, avoiding bespoke scripts or operator intervention).
There is apparently a race condition. If the karpenter pod has hung for any reason, or if karpenter's control loop has not finished reacting before the EKS control plane etc is torn down, then EC2 instances will be orphaned and never get cleaned up.

For cluster-autoscaler this was never a problem because the instances always belonged to cluster-specific ASGs (so de-provisioning those ASGs would automatically clean up the instances too). Without using ASGs, the obvious solution is a Karpenter TF provider. This is because a provider is terraform's mechanism to encapsulate a cleanup script and sequence it to run when the user deprovisions the corresponding configuration resource (in this case representing the karpenter installation and the cluster control plane that it depends on).

I'm also unclear how the first/last nodes are supposed to be managed (when bootstrapping and tearing down a cluster with karpenter nodes). Previously we would terraform an ASG with a minimum size of 1, which would generate at least one instance where the scheduler would prioritise the cluster-autoscaler pod, etc. Those karpenter blueprints appear to rely on also terraforming a MNG for the karpenter pod to initially run on. I think a karpenter TF provider would simplify bootstrapping (it's counter-intuitive to require an explicit MNG config in terraform when seeking to use karpenter which eschews MNGs & ASGs) and help keep terraform configs robust against rare cloud outages (avoiding the need for duplication to be maintained between an MNG config and the core karpenter config).

aws / karpenter-provider-aws

How to tear down cluster? (Can Karpenter use an ASG, or be a TF provider?) #5727

Description