aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.83k stars 960 forks source link

Karpenter Log Readability for Unschedulable Pods #7199

Open 1it opened 1 month ago

1it commented 1 month ago

Description

What problem are you trying to solve?

Currently, when Karpenter can't schedule a pod, the logs become flooded with detailed information about every incompatible node pool in the cluster. This makes it extremely difficult to pinpoint the actual cause of the scheduling failure, especially when dealing with a large number of node pools. I waste a lot of time sifting through irrelevant information to find the relevant details.

Proposed behaviour:

Instead of listing all incompatible node pools, the log should:

How important is this feature to you?

This is a significant pain point for me as an SRE engineer. Clearer logs would drastically improve my debugging efficiency and reduce the time it takes to resolve scheduling issues. This would be a valuable improvement for anyone operating Karpenter, particularly in larger environments.

njtran commented 1 month ago

Hey @1it, this is definitely a hard problem. At a core, when you see a pod schedule, you probably have an understanding of which NodePools you do and don't care about for that pod. When Karpenter simulates the pod scheduling, it tries to schedule the pod to each of the NodePools. How would you suggest we make the log clearer so that we can remove some of the unnecessary information (in your eyes) vs the necessary information? Logs should ideally give the full information about a decision, but like you're saying, it's easy to be overwhelmed with too much information in the case of many nodepools here.

1it commented 1 month ago

Hey @1it, this is definitely a hard problem. At a core, when you see a pod schedule, you probably have an understanding of which NodePools you do and don't care about for that pod. When Karpenter simulates the pod scheduling, it tries to schedule the pod to each of the NodePools. How would you suggest we make the log clearer so that we can remove some of the unnecessary information (in your eyes) vs the necessary information? Logs should ideally give the full information about a decision, but like you're saying, it's easy to be overwhelmed with too much information in the case of many nodepools here.

Hey @njtran, that's right. In my case, I've got dozens of nodepools and basically no clue what went wrong given the format of log message. It's not only large number of duplicated info (which is also consuming disk space), the format itself is hardly readable, where similar messaged got stacked-up. It's just hard to extract meaningful information even with search (quite hard to grep as well).

OverStruck commented 1 month ago

Sounds like because karpenter goes thru each node pool, it logs an error for each node pool that couldn't be used. So if you have a lot of node pools, you'll see messages for all. And what @njtran is saying is that because of this, the idea is that you should know which node pool you care about (which one was supposed to be used for your pod) and check in the logs the specific messages for that node pool.

Karpenter, if I'm understanding @njtran correctly, does not know which node pool you want to use for a given pod when you have more than 1 node pool. Of course, if your pod has nodeSelector, and the 1 out of 2 node pools configures said label on the nodes, then perhaps you could argue Karpenter should know only 1 out of those 2 node pools is the correct one. because requirements match (pod has nodeSelector and node pool creates node w/ said label).

But even in the above scenario, Karpenter still needs to look at each node pool to know which one meets requirements.

The only "fix" I see for this is adding some annotation to pods where you specify your Karpenter nodepool, so when scheduling fails, logs only show the node pool specified in the pod annotation and perhaps it could even help with performance, instead of looping thru all node pools, just use the one in the annotation.

TLDR: Karpenter does not know before hand which is the correct node pool when you have many, and will loop through each one until one can be used pod unscheduled pod. Therefore it'll log for each node pool in the loop.

1it commented 1 month ago

Sounds like because karpenter goes thru each node pool, it logs an error for each node pool that couldn't be used. So if you have a lot of node pools, you'll see messages for all. And what @njtran is saying is that because of this, the idea is that you should know which node pool you care about (which one was supposed to be used for your pod) and check in the logs the specific messages for that node pool.

Karpenter, if I'm understanding @njtran correctly, does not know which node pool you want to use for a given pod when you have more than 1 node pool. Of course, if your pod has nodeSelector, and the 1 out of 2 node pools configures said label on the nodes, then perhaps you could argue Karpenter should know only 1 out of those 2 node pools is the correct one. because requirements match (pod has nodeSelector and node pool creates node w/ said label).

But even in the above scenario, Karpenter still needs to look at each node pool to know which one meets requirements.

The only "fix" I see for this is adding some annotation to pods where you specify your Karpenter nodepool, so when scheduling fails, logs only show the node pool specified in the pod annotation and perhaps it could even help with performance, instead of looping thru all node pools, just use the one in the annotation.

TLDR: Karpenter does not know before hand which is the correct node pool when you have many, and will loop through each one until one can be used pod unscheduled pod. Therefore it'll log for each node pool in the loop.

I understand the current logic of Karpenter. I don't expect Karpenter "know" what I need. What I was suggesting is only about making log message less clogged and more concise. It's probably not the easiest task, but as an "end-user" of Karpenter I just want to see a more helpful error message. There's simply no value printing out all the nodepools that cannot even remotely satisfy criteria for my pod. For example, If there's no matching nodepool you can print something like (very roughly):

njtran commented 1 month ago

The only thing I could really see is just de-duplicating nodepools that have the same failure explanation on why they're not compatible. If the reason your nodepools aren't compatible are due to say something like a custom label that's different on every nodepool though, this doesn't help. It'd be strictly better though.