aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.68k stars 934 forks source link

fix: default to multizone karpenter deployment #6354

Closed rschalo closed 3 months ago

rschalo commented 3 months ago

Fixes #N/A

Description Change default behavior so that both replicas cannot schedule to the same availability zone.

How was this change tested?

Does this change impact docs?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify[bot] commented 3 months ago

Deploy Preview for karpenter-docs-prod canceled.

Name Link
Latest commit 5525628c4ac860dcfba53f91b63712d5b9d5d1a8
Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/6669ed525494310008768b23
coveralls commented 3 months ago

Pull Request Test Coverage Report for Build 9488069911

Details


Files with Coverage Reduction New Missed Lines %
pkg/providers/amifamily/ami.go 1 90.56%
<!-- Total: 1 -->
Totals Coverage Status
Change from base Build 9475149132: -0.01%
Covered Lines: 5550
Relevant Lines: 6730

💛 - Coveralls
jmdeal commented 3 months ago

I'm on board with changing this to be the default, but should we surface a configuration option in the helm chart to override this default? If we did this from the start I'd defer to user demand, but this would now be a breaking change for customers who may have infra already set up relying on this being a ScheduleAnyway TSC.

rschalo commented 3 months ago

So, my thinking is that if leave this as is and don't surface an option to override, there are two kinds of cx that would break. The first are those that would already break given a zonal outage, which would not different behavior with DoNotSchedule as the default. The other case, is that Karpenter is running two replicas on the same node, the leader pod goes unhealthy, loses leadership, and then there is not a second node because of the DoNotSchedule TSC.

This PR changes the value which can be changed by cx, setting this by default requires a chart default update that we can revisit later but I think changing the values and making this recommendation is sufficient for now.