aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.6k stars 919 forks source link

Karpenter not honoring/working with topology spread constraints or pod affinity #6694

Open bwmetcalf opened 1 month ago

bwmetcalf commented 1 month ago

Description

Observed Behavior:

This will be rather long as I describe the different scenarios we've tested. We have a deployment that, by default, does not specify topologySpreadConstraints or affinity and using the k8s default constraints with no node selectors or tolerations, the three replicas get deployed across our three AZs in us-west-2 to our untainted node pool. We are attempting to provide a dedicated node pool for this deployment and cannot seem to get karpenter to honor or work with different combinations of topologySpreadConstraints and/or affinity configurations. Below are the node pool and and subnet sections of the node class definitions, pod nodeSelector and tolerations, along with the behavior we are seeing with the different scenarios.

Node pool:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  annotations:
    karpenter.sh/nodepool-hash: "11712266115069733881"
    karpenter.sh/nodepool-hash-version: v2
  creationTimestamp: "2024-08-08T19:31:29Z"
  generation: 2
  name: deployment-node-pool
  resourceVersion: "995009893"
  uid: df0ab507-2874-4483-832a-f1d26b551bc9
spec:
  disruption:
    budgets:
    - nodes: 10%
    consolidateAfter: 30s
    consolidationPolicy: WhenEmpty
    expireAfter: Never
  limits:
    memory: 128Gi
  template:
    metadata:
      labels:
        deployment-dedicated: "true"
        node.blah/include-target-group-deployment-http: "true"
        node.blah/include-target-group-deployment-https: "true"
    spec:
      nodeClassRef:
        name: blah-default-node-class
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
      - key: karpenter.k8s.aws/instance-family
        operator: In
        values:
        - m5
      taints:
      - effect: NoSchedule
        key: deployment-dedicated
status:
  resources:
    cpu: "6"
    ephemeral-storage: 1572089868Ki
    memory: 23945624Ki
    pods: "87"

Node class snippet:

...
  subnetSelectorTerms:
  - id: subnet-0c94a1cdcd52dd53f
  - id: subnet-0f1fc6fa5767fdfa9
  - id: subnet-0a1c2721b7e0ff43a
...
status:
...
  subnets:
  - id: subnet-0a1c2721b7e0ff43a
    zone: us-west-2c
    zoneID: usw2-az3
  - id: subnet-0f1fc6fa5767fdfa9
    zone: us-west-2b
    zoneID: usw2-az2
  - id: subnet-0c94a1cdcd52dd53f
    zone: us-west-2a
    zoneID: usw2-az1

The first thing we tried is a topologySpreadConstraints definition as follow (tried with ScheduleAnyway and DoNotSchedule):

  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          product: blah
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          product: blah

This resulted in karpenter spinning up the first node and scheduling all three pods on it. We then attempted the following:

  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: "product"
            operator: In
            values:
            - "blah"
        topologyKey: topology.kubernetes.io/zone

  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          product: blah
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          product: blah

which resulted in karpenter not responding at all and the first pod of the deployment never getting scheduled. The following schedules all three pods in the deployment, but does not spread them across AZs (the only change from above is the affinity topology key):

  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: "product"
            operator: In
            values:
            - "blah"
        topologyKey: kubernetes.io/hostname

  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          product: blah
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          product: blah

Below are from the pods:

  nodeSelector:
    blah-dedicated: "true"
    node.blah/include-target-group-blah-http: "true"
    node.blah/include-target-group-blah-https: "true"
  tolerations:
  - effect: NoSchedule
    key: deployment-dedicated
    operator: Exists

Expected Behavior: Karpenter would honor the topologySpreadConstraints or affinity settings and spread pods across three nodes in three AZs.

Reproduction Steps (Please include YAML):

Versions:

njtran commented 1 month ago

So if I understand correctly, you're doing

  1. preferred zonal tsc and preferred hostname tsc for product: blah labeled pods
  2. adding another deployment with pod anti affinity on zone, with preferred TSC on zone and hostname for product: blah pods.

These pods that you've shared don't have the product: blah label. Is this expected? If you're trying to make them spread on each other, these TSC/anti-affinity constraints are targeting other pods, not the ones you've created.

In general, I'm wondering why you need preferred zonal and hostname spread, and also zonal anti affinity. You're trying to create a deployment that isn't scheduled on the same instance and also is spread evenly spread across each instance and AZ? Seems like you could either remove the required hostname anti affinity or the preferred zonal topology spread, and that should make it easier to reason about.