aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.14k stars 846 forks source link

Update NodePool.Status with node drift information. #6166

Open shreyas-badiger opened 1 month ago

shreyas-badiger commented 1 month ago

Description

What problem are you trying to solve? Currently, there is no clear way to identify how many nodes have drifted from the current hash (nodepool hash and ec2nodeclass hash). To determine the node rotation progress, we will have to look into individual node objects, nodeclaims, nodepool and nodeclass.

Since Karpenter controller identifies and rotates the drifted nodes, I am assuming the controller already maintains the list of drifted nodes (if not, identifies the drifted nodes in every reconciliation.) It will be helpful to surface this information in the NodePool status.

for ex:

status:
  resources:
    cpu: "64"
    ephemeral-storage: 134205420Ki
    memory: 258565188Ki
    pods: "640"
    nodes:
      totalNodes: 10
      driftedNodes: 2

How important is this feature to you? This feature will be very useful to identify the progress of node rotation whenever we change AMIs or trigger any other form of upgrades by updating the nodepool or ec2nodeclass.

jonathan-innis commented 1 month ago

Does a metric work for you here? Or would you like a constant update of rolled-up information directly in the status? This is definitely something we are thinking about as we are thinking about how we can improve our observability of Karpenter for v1.

jonathan-innis commented 1 month ago

Does this request belong in the kubernetes-sigs/karpenter repo since it's about the netural concept of drift?

vgunapati commented 1 month ago

It would be advantageous to have this data accessible in both metrics and CR status. Adding it to the CR status would greatly benefit other watchers in the cluster, We should also consider adding the number of Nodes that are restricted because of PDB violations.

jonathan-innis commented 1 month ago

It would be advantageous to have this data accessible in both metrics and CR status

We do have to be a little thoughtful about the number of updates that this would generate. I'm not saying that it's out of the question, but a metric is a bit easier to swallow only because they're pull-based and not push-based.

there is no clear way to identify how many nodes have drifted from the current hash

You could take a look at the NodeClaim status conditions to see if a NodeClaim has a "Drifted" condition. Counting these up across the cluster (or by label) should give you the info you want. Yeah, you have to construct it, but given that this doesn't currently exist in Karpenter, this is a possible workaround.