aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.15k stars 849 forks source link

Labels applied to nodes via NodePool generate requirements and subsequent errors #6387

Closed fullykubed closed 1 day ago

fullykubed commented 1 week ago

Description

Observed Behavior:

Labels added to spec.template.metadata of NodePool resources are added to the requirements array of generated NodeClaim resources.

If the number of labels exceeds 30, the following errors are generated:

NodeClaim.karpenter.sh \"spot-arm-tqsgl\" is invalid: [spec.requirements: Too many: 53: must have at most 30 items, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"

In practice, the error occurs with much fewer than 30 as you would likely have other requirements.

Expected Behavior:

I imagine there is some intended purpose for propagating labels as requirements, but I am not sure what it might be.

I think that this should be opt-in behavior rather than the default (or it might be an outright bug). I do not know what benefit this would provide in the majority of circumstances, but it hinders the ability to apply labels that should not impact scheduling nodes (for example, for ownership and billing purposes).

At the very least, I think this behavior should be documented.

Reproduction Steps (Please include YAML):

This NodePool

apiVersion: karpenter.sh/v1beta1
kind: NodePool
spec:
  template:
    metadata:
      labels:
        foo: bar

would generated this NodeClaim

apiVersion: karpenter.sh/v1beta1
kind: NodeClaim
spec:
  requirements:
  - key: foo
    operator: In
    values:
    - bar

Versions:

jigisha620 commented 1 week ago

There is a reason why we are propagating these labels as requirements on nodeClaim. These labels are propagated to nodes which are instrumental in node selection by using nodeAffinity or nodeSelectors on your Pods.

fullykubed commented 1 week ago

These labels are propagated to nodes which are instrumental in node selection by using nodeAffinity or nodeSelectors on your Pods.

This part I understand and my intent is to provide an arbitrary number of labels to generated nodes.

There is a reason why we are propagating these labels as requirements on nodeClaim.

This part I do not understand. I am sure there is a reason, but it is unclear why every node label must be an individual requirement on every NodeClaim. Additionally, this doesn't appear to be a strict rule as I can see Karpenter add many node labels that are not treated as requirements.

Maybe my mental model is incorrect, but wouldn't it make sense to only add the label as a requirement if it is used in a label selector by a pending pod?

jigisha620 commented 1 week ago

When karpenter tries to schedule your workload, it has to take into consideration all the requirements from nodePool (labels as well as requirements) as well as pod requirements. If a pod has a node affinity for a specific label then Karpenter needs to know that it is a requirement for the launch and it orchestrates that by adding it to nodeClaim requirements.

fullykubed commented 1 week ago

If a pod has a node affinity for a specific label then Karpenter needs to know that it is a requirement for the launch and it orchestrates that by adding it to nodeClaim requirements.

I think you are highlighting the point I am trying to make: Even though no pods are using these labels for node selection, all labels from the NodePool's spec.template.metadata.labels are still being added as requirements.

My expectation would be that only those labels that are being used by pods for node selection would be added as requirements.

In particular, I do not believe this statement is correct:

it has to take into consideration all the requirements from nodePool (labels as well as requirements)

At the very least this does not seem to be the documented behavior for NodePools.

Per the documentation for spec.template.spec.requirements:

# Requirements that constrain the parameters of provisioned nodes.
# These requirements are combined with pod.spec.topologySpreadConstraints, pod.spec.affinity.nodeAffinity, pod.spec.affinity.podAffinity, and pod.spec.nodeSelector rules.

There is no mention of spec.template.metadata.labels being implicitly added as additional requirements.

Also this conceptually doesn't seem to make sense. If someone wanted a node label to always be used as a requirement, that is the purpose of the spec.template.spec.requirements field.

Adding all of the labels implicitly even when they are never used by pods for scheduling seems like a bug and adds an unnecessary limitation to Karpenter (a hard cap on the number of labels you can apply to nodes).

jigisha620 commented 1 week ago

While I understand the point you are trying to make here, if we expect the node to have a label then we consider it as a requirement on nodeClaim. However, for your use case we can go ahead and increase the limit to the number of labels that work for you. Does increasing the limit for spec.requirements to 100 work for you?

fullykubed commented 1 week ago

100 would be more than enough for my use case personally. I appreciate your flexibility in helping to find a workaround.

I also think that if adding all node labels as requirements is the intended behavior and isn't going to change, that should probably be added to the documentation along with a warning that there is a hard cap of 100 total requirements and labels.

jonathan-innis commented 1 day ago

Closing since we've merged the change to allow 100 requirements. This change should be present in the next release. We're going to do a docs update in a separate change that won't be tracked by this original issue.