[question]: Node Under Utilization

uptownhr commented 2 months ago

Prior Search

[X] I have already searched this project's issues to determine if a similar question has already been asked.

What is your question?

After upgrading to edge.2024-09-04 and edge.2024-09-10 the node utilization has been sitting around 50%. I've watched for over 4 hours of pods stabilizing and nodes being spun up and down. Now the nodes have stablized for over 2 hours and is no longer consolidating.

Here are all the event logs from nodes that I believe could have been consolidated but were blocked

Disruption Blocked: `pdb "valut/vault" prevents pod eviction
Disruption Blocked: `pdb "authentk/pvc-annotator-..." prevents pod eviction
Disruption Blocked: `pdb "alb-controller/alb-controller" prevents pod eviction
Disruption Blocked: `pdb "cert-manager/cert-manager-webhook" prevents pod eviction
Unconsolidatable: can't remove without creating 2 candidates

I also noticed that not much was being scheduled onto the controller nodes. Both of my controller nodes only have 4 pods running. I don't know if this is expected but seems to be different than what I remember.

What can be done to resolve the PDBs and why aren't they be scheduled on the controller nodes?

What primary components of the stack does this relate to?

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

fullykubed commented 2 months ago

why aren't they be scheduled on the controller nodes?

Just verified that on the reference cluster running edge.2024-09-10, there is nothing preventing pods from most of our modules from running on controller nodes.

Keep in mind that the bin-packing scheduler will try to pack the pods onto as few nodes as possible. It is not unusual that some nodes will have very low utilization for that reason. That allows Karpenter to spin them down. However, Karpenter will never spin down a controller node. I believe that is likely what you are seeing here.

We can look into a way to optimize the behavior here.

What can be done to resolve the PDBs?

A PDB blocks disruption when not enough pods in its set are running and healthy as to allow further disruption. You need to provide more information here for each PDB, specifically why pods in their sets are already unhealthy. Typically this is because they have already been evicted for one reason or another. You can look at the Kubernetes events to find all the reasons for a pod's eviction.

Looking at the reference cluster running edge.2024-09-10, I am seeing >90% utilization. As a result, you should ensure you have everything upgraded and then look into why your cluster is unstable wrt your PDBs.

fullykubed commented 2 months ago

An optimization for the controller node bin-packing has been included in the next release.

Panfactum / stack