Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.93k stars 293 forks source link

Karpenter support for cluster autoscaling #2712

Open alexisbel1 opened 2 years ago

alexisbel1 commented 2 years ago

Karpenter is an open-source node provisioning project built for Kubernetes. Its goal is to improve the efficiency and cost of running workloads on Kubernetes clusters. Karpenter works by:

Karpenter has many advantages over cluster autoscaler. One prerequisite would be that AKS can manage multiple instance types without defining multiple node pools.

Currently the only cloud provider which support Karpenter is AWS.

It would be awesome to have AKS support it.

ghost commented 2 years ago

Hi alexisbel1, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

ghost commented 2 years ago

Triage required from @Azure/aks-pm

ghost commented 2 years ago

Action required from @Azure/aks-pm

aido123 commented 2 years ago

+1

AstritCepele commented 2 years ago

+1

nahum-litvin-hs commented 2 years ago

+1 would also love to see this.

apton-sooraj commented 2 years ago

+1

tarun-asthana commented 2 years ago

+1

trash-anger commented 2 years ago

+1

laport-n commented 2 years ago

+1 !

palma21 commented 2 years ago

Karpenter is an open-source node provisioning project built for Kubernetes.

Looking at the project, it doesn't seem generic enough to be used across all of kubernetes, seems to only work with AWS.

One prerequisite would be that AKS can manage multiple instance types without defining multiple node pools.

Unfortunately, VMSS today only supports one type, but there is work being done by the VMSS team to allow for this.

All the bullet points you mentioned are in scope for cluster autoscaler, but you mentioned Karpenter has many advantages over CA. Could you be a bit more specific on those? What things would you like to accomplish on AKS

alexisbel1 commented 2 years ago

All the bullet points you mentioned are in scope for cluster autoscaler, but you mentioned Karpenter has many advantages over CA. Could you be a bit more specific on those? What things would you like to accomplish on AKS

The main advantage over CA is the ability to provision new VM types based on workload requirements (resources, taints...). CA will only up and down VM of the same type in a VMSS (that why it would require to allow multiple VM types in the same node group). In case, the VM type does not match workload requirements (e.g. GPU), the pod won't be able to start.

guettli commented 2 years ago

Cluster API supports several providers: https://cluster-api.sigs.k8s.io/

wasabii commented 2 years ago

For some context for others here:

Karpenter is new. And it was written by Amazon. However it's intended that additional cloud providers be added to it, just like cluster auto scaler. The source is open, and it's waiting for engineers to contribute.

It would ideally be the task of the Azure/AKS teams to provide the necessary resources to implement the Azure provider.

What differentiates it from the cluster auto scaler is it has no concept of Node Groups. There is no need to allocate a classification of a Node Group up front. Instead, it examines the requirements, and expects the cloud provider to be able to allocate exactly what it needs, from the smorgasbord of offerings the cloud provider might have.

That means if you need a machine with 32GB of ram, it'll go make one for you. If you need a spot instance it'll go make one for you. If it needs a node on AZ 2, it'll go make one for you. It doesn't require you to define all of the possible classes up front. Or it can consult the cloud provider for the most cost effective option that meets the requirements at the moment.

This does present some architectural challenges as to how this would be surfaced in AKS. Would it just go and create VMs one by one? Would it still use a VMSS, but require arbitrary resource request support within a VMSS? How will network topology be defined in the former? Etc.

But it is a much more extensible approach than the way CA is built. At least for cloud providers. Azure obviously has hundreds of VM family, series, size, and disk capabilities, operating systems, etc, and the cartesian product of them all is massive.

ellistarn commented 2 years ago

๐Ÿ‘‹ I lead the Karpenter project. We'd love to collaborate on additional cloud providers and have done our best to factor out a simple and extensible cloud provider API to minimize the effort for other providers to adopt. If you're interested in chatting about the project, feel free to join in at our working group.

dkbhadeshiya commented 1 year ago

This would really be an interesting feature to support Azure/AKS

seyal84 commented 1 year ago

interesting and following this for future. Cannot wait to test this in AKS, whenever this feature is supported.

markthebault commented 1 year ago

+1

JungBin-Eom commented 1 year ago

+1 Really interesting feature๐Ÿ‘

vishal-swarankar-sdl commented 1 year ago

+1 for it. happy to collaborate

denisp13 commented 1 year ago

+1 looking forward to having Karpenter onboarded to AKS

stackErr-NameNotResolved commented 1 year ago

Would be very helpful

jenciso commented 1 year ago

+1 It would be amazing to have karpenter in AKS

tbrigley commented 1 year ago

Having recently implemented Karpeneter on all of our non-prod EKS clusters and also moving to spot instances we are seeing significant improvement in orchestration and cost savings with these often spikey workloads. Node counts and cost are down, and there has been no downside over 3 months now. We deploy a few nodes for karpenter and set affinity, and let karpenter do the rest from the karpenter helm recipes and some node pool definitions. Karpenter's methods of determining node size and aligning nodes with pods for best behavior is really just unseen before on kube (imho, debates welcome!). this same logical handling for spot vm's/nodes for AKS would be incredibly helpful and useful for AKS. It doesn't really compare to other CA because it takes out so much guesswork, and could be used alongside CA if done properly. Its an Apache 2.0 license. hard +1

vinshetty commented 1 year ago

+1 This would really be an interesting feature to support Azure/AKS

ppodevlabs commented 1 year ago

+1, this would be a really nice improvement for AKS

philwelz commented 1 year ago

+1 ๐Ÿ‘

eb-koddi commented 1 year ago

+1 After using Karpenter with AWS EKS, it is extremely painful to manage cluster autoscaling on Azure...

poblahblahblah commented 1 year ago

+1

It would be a massive improvement for AKS if AKS had Karpenter support.

SerdarYalcin commented 1 year ago

+1 It would be amazing to use Karpanter on Azure.

wesctl commented 1 year ago

Check out spot.io Ocean. It does bin packing on Azure w/ additional features on top.

mattstep commented 1 year ago

Big +1 for Karpenter on AKS

gmslabs commented 1 year ago

Hard +1

saurabhvagrawal commented 1 year ago

+1

sharninder commented 1 year ago

Hard +1

damoshushu commented 1 year ago

+1

bplasmeijer commented 1 year ago

Yes, I agree.

Provisioning nodes that meet the requirements of the pods should save a lot of overprovisioned CPUs and keep the planet healthy.

It's always a balance between small/large node pool SKUs. (balanced workload should be better) Yes, multi-node pools, with Scale User node pools to 0 helps a bit. Assigning pods to nodes using node affinity can not solve the below user story.

example, SKU 16CPU

{[POD-4][POD-4][POD-4][POD-4]}{[POD-4][SPARE-12CPU]}

Waste of 12CPU.

cc: @palma21 / @brendandburns

pzelensky commented 1 year ago

+1

bourbonfgiles commented 1 year ago

+1

Flawke commented 1 year ago

I can't describe enough how badly this is needed.

hyperevo commented 1 year ago

+1

rkydx commented 1 year ago

+1

MartyB-007 commented 1 year ago

+1

nokhiz commented 1 year ago

+1

gunnars04 commented 1 year ago

+1

Jamess-Lucass commented 1 year ago

+1

wfclark5 commented 1 year ago

+1

davidgarc commented 1 year ago

+1

dnitsch commented 1 year ago

๐Ÿ‘

mtrin commented 1 year ago

If AKS could use https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/spot-priority-mix which I understand not possible at the moment, would it not resolve many of the issues related to this?