aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.82k stars 958 forks source link

EKS in-place upgrades without data loss or data transfer using Karpenter #7300

Open guibirow opened 2 weeks ago

guibirow commented 2 weeks ago

Description

What problem are you trying to solve?

I currently have several stateful applications and databases running on EC2 instances using local Instance storage with fast NVMe SSDs (i3en, i4g) to store 100s of TBs of data. When we need to handle upgrades, we can simply upgrade the processes and packages inside the node and consider it up-to date.

I am considering moving these workloads into EKS and manage the nodes with Karpenter. One of the requirements we have is to keep EKS up to date with latests EKS versions. The default upgrade mechanism on EKS creates new nodes instead of upgrading them in-place and reusing the nodes. This requirement will force our EKS clusters to be upgraded multiple times a year, which is a disruptive and sensitive operation for stateful applications when we have to move 100s of TB on each upgrade for each cluster, moving that much data is a slow process that can take days to avoid impacting the performance of the cluster.

We are evaluating options to mitigate node rotation by having nodes upgraded in-place without losing the data in the ephemeral storage of the node. Unfortunately, EBS is not an option for this setup.

Is there any way to handle in-place node upgrades with Karpenter?

How important is this feature to you?

Not having this handled by karpenter, would require us to implement hacky ways to achieve in-place upgrades, that can cause inconsistency across nodes, which we prefer to avoid but I am keen to hear your approaches if it solves the problem.

engedaam commented 1 week ago

Would it be possible to close this issue as a duplicate of https://github.com/kubernetes-sigs/karpenter/issues/1123. Would be useful to have this write up in that issue