Provisioner-level topology

rolindroy commented 2 years ago

Version

Karpenter: v0.5.6

Kubernetes: v1.21.4

Expected Behavior

Karpenter should place the nodes in multiple subnets

Actual Behavior

Karpenter create all the nodes in the same subnet even though It was able to discover all the available subnets using subnet selector.

Resource Specs and Logs

2022-02-08T12:16:08.800Z    DEBUG   controller.provisioning Discovered EC2 instance types zonal offerings   {"commit": "2346ed5", "provisioner": "dev"}
2022-02-08T12:21:09.546Z    DEBUG   controller.provisioning Discovered 358 EC2 instance types   {"commit": "2346ed5", "provisioner": "dev"}
2022-02-08T12:21:09.645Z    DEBUG   controller.provisioning Discovered subnets: [subnet-0dc81794a17f94085 (eu-west-1c) subnet-0cffedae4da61fc80 (eu-west-1b) subnet-05fb0d500a7b158ed (eu-west-1a)] {"commit": "2346ed5", "provisioner": "dev"}

kind: Provisioner
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"karpenter.sh/v1alpha5","kind":"Provisioner","metadata":{"annotations":{},"name":"dev"},"spec":{"labels":{"instanceType":"spotInstance"},"limits":{"resources":{"cpu":500,"memory":"1000Gi"}},"provider":{"instanceProfile":"dev-eks-node-iam-role","launchTemplate":"eks-node-dev-karpenter-lt","securityGroupSelector":{"Name":"dev-sg-eks-node"},"subnetSelector":{"kubernetes.io/cluster/dev":"shared","role":"k8s"}},"requirements":[{"key":"karpenter.sh/capacity-type","operator":"In","values":["spot"]},{"key":"node.kubernetes.io/instance-type","operator":"In","values":["r5n.4xlarge","r5dn.4xlarge","m5n.4xlarge"]}],"ttlSecondsAfterEmpty":60}}
  creationTimestamp: "2022-02-03T10:58:11Z"
  generation: 10
  name: dev
  resourceVersion: "958343225"
spec:
  kubeletConfiguration: {}
  labels:
    instanceType: spotInstance
  limits:
    resources:
      cpu: "500"
      memory: 1000Gi
  provider:
    apiVersion: extensions.karpenter.sh/v1alpha1
    instanceProfile: dev-eks-node-iam-role
    kind: AWS
    launchTemplate: eks-node-dev-karpenter-lt
    securityGroupSelector:
      Name: dev-sg-eks-node
    subnetSelector:
      kubernetes.io/cluster/dev: shared
      role: k8s
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - spot
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - r5n.4xlarge
    - r5dn.4xlarge
    - m5n.4xlarge
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  ttlSecondsAfterEmpty: 60
status:
  resources:
    cpu: "256"
    memory: 842163320Ki

All the nodes were placed in the same subnet and same az (u-west-1a).

  | karpenter.sh/provisioner-name/dev | i-**************** | Running | m5n.4xlarge | 2/2 checks passed |   | eu-west-1a | – | – | – | – | disabled | dev-eks-node-sg | nonprod | 2022/02/08 15:21 GMT+4
  | karpenter.sh/provisioner-name/dev | i-**************** | Running | m5n.4xlarge | 2/2 checks passed |   | eu-west-1a | – | – | – | – | disabled | dev-eks-node-sg | nonprod | 2022/02/08 15:21 GMT+4
  | karpenter.sh/provisioner-name/dev | i-**************** | Running | m5n.4xlarge | 2/2 checks passed |   | eu-west-1a | – | – | – | – | disabled | dev-eks-node-sg | nonprod | 2022/02/08 15:18 GMT+4
  | karpenter.sh/provisioner-name/dev | i-**************** | Running | m5n.4xlarge | 2/2 checks passed |   | eu-west-1a | – | – | – | – | disabled | dev-eks-node-sg | nonprod | 2022/02/08 15:18 GMT+4
  | karpenter.sh/provisioner-name/dev | i-**************** | Running | m5n.4xlarge | 2/2 checks passed |   | eu-west-1a | – | – | – | – | disabled | dev-eks-node-sg | nonprod | 2022/02/08 15:18 GMT+4
  | karpenter.sh/provisioner-name/dev | i-**************** | Running | m5n.4xlarge | 2/2 checks passed |   | eu-west-1a | – | – | – | – | disabled | dev-eks-node-sg | nonprod | 2022/02/08 15:18 GMT+4
  | karpenter.sh/provisioner-name/dev | i-**************** | Running | m5n.4xlarge | 2/2 checks passed |   | eu-west-1a | – | – | – | – | disabled | dev-eks-node-sg | nonprod | 2022/02/08 15:21 GMT+4
  | karpenter.sh/provisioner-name/dev | i-**************** | Running | m5n.4xlarge | 2/2 checks passed |   | eu-west-1a | – | – | – | – | disabled | dev-eks-node-sg | nonprod | 2022/02/08 15:18 GMT+4

gmcoringa commented 2 years ago

I have a similar problem, even using topology spread all pods are allocated in the same machine:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web-t6
  namespace: test
spec:
  selector:
    matchLabels:
      app: nginx
  serviceName: "nginx"
  replicas: 3
  template:
    metadata:
      labels:
        app: nginx
    spec:
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: nginx
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: nginx

Karpenter: v0.6.1

Kubernetes: v1.20.11

felix-zhe-huang commented 2 years ago

Karpenter should place the nodes in multiple subnets

Thank you Rolind. If I understand this correctly, the subnet selector will instruct Karpenter which subnets are averrable (in this case it detected all three subnets correctly). Based on the averrable subnets, Karpenter computes which instance types can be used and how to override the launch template. However, the scheduling logic does not automatically spread the pods across available zones. Instead, Karpenter relies on topology spread constraints to achieve that.

felix-zhe-huang commented 2 years ago

I have a similar problem, even using topology spread all pods are allocated in the same machine:

Hi Fabiano, this is indeed strange. Can you also share the Karpenter logs and your provisioner config?

ellistarn commented 2 years ago

I have a similar problem, even using topology spread all pods are allocated in the same machine:

We've seen issues with the default scheduler where this can happen if capacity is already available. If you have a single node in a cluster, and deploy 3 pods with hostname/topology spread, the kube scheduler will "spread" across existing nodes if there's room. If the node has room for all 3 pods, they'll happily all schedule there. Karpenter knows the possible zones and will force spread them during provisioning, but we can't control the kube scheduler.

ellistarn commented 2 years ago

All the nodes were placed in the same subnet and same az (u-west-1a).

When using spot, Karpenter will choose the cheapest instance type. In this case, it looks like us-west-1a was the cheapest.

gmcoringa commented 2 years ago

I have a similar problem, even using topology spread all pods are allocated in the same machine:

Hi Fabiano, this is indeed strange. Can you also share the Karpenter logs and your provisioner config?

Sure, some details, there were a single node on the cluster, provisioned by eks managed node groups.

a different version of the statefulset

``` apiVersion: apps/v1 kind: StatefulSet metadata: name: web-t6 namespace: test spec: selector: matchLabels: app: nginx serviceName: "nginx" replicas: 3 template: metadata: labels: app: nginx spec: terminationGracePeriodSeconds: 10 containers: - name: nginx image: k8s.gcr.io/nginx-slim:0.8 ports: - containerPort: 80 name: web # volumeMounts: # - name: www # mountPath: /usr/share/nginx/html affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: tribe operator: In values: ["my-team"] topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: nginx - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: nginx ```

provisioner

```yaml kind: Provisioner apiVersion: karpenter.sh/v1alpha5 metadata: name: spot spec: labels: capacityType: spot tribe: my-team limits: resources: cpu: '100' memory: 200Gi provider: apiVersion: extensions.karpenter.sh/v1alpha1 instanceProfile: KarpenterNodeInstanceProfile kind: AWS securityGroupSelector: kubernetes.io/karpenter/dev: sg-yyyyyyyyyyyy subnetSelector: kubernetes.io/karpenter/dev: owned requirements: - key: karpenter.sh/capacity-type operator: In values: - spot - key: node.kubernetes.io/instance-type operator: In values: - t3a.xlarge - c5a.2xlarge - t3.xlarge - c5.2xlarge - c5n.2xlarge - key: kubernetes.io/arch operator: In values: - amd64 ttlSecondsAfterEmpty: 300 ttlSecondsUntilExpired: 2592000 ```

karpenter helm chart values

```yaml controller: clusterName: dev clusterEndpoint: https://ENDPOINT_HIDDEN serviceAccount: name: karpenter annotations: eks.amazonaws.com/role-arn: arn:aws:iam::1111111111111:role/karpenter ```

karpenter controller logs

``` 2022-02-09T10:40:13.306Z INFO Successfully created the logger. 2022-02-09T10:40:13.306Z INFO Logging level set to: debug {"level":"info","ts":1644403213.4128373,"logger":"fallback","caller":"injection/injection.go:61","msg":"Starting informers..."} 2022-02-09T10:40:13.412Z DEBUG controller Using AWS region us-east-1 {"commit": "df57892"} I0209 10:40:13.459307 1 leaderelection.go:243] attempting to acquire leader lease kube-system/karpenter-leader-election... 2022-02-09T10:40:13.459Z INFO controller starting metrics server {"commit": "df57892", "path": "/metrics"} I0209 10:40:31.677124 1 leaderelection.go:253] successfully acquired lease kube-system/karpenter-leader-election 2022-02-09T10:40:31.677Z DEBUG controller.events Normal {"commit": "df57892", "object": {"kind":"ConfigMap","namespace":"kube-system","name":"karpenter-leader-election","uid":"a953b3a8-3567-4b5c-a65f-4af06762a870","apiVersion":"v1","resourceVersion":"5780"}, "reason": "LeaderElection", "message": "karpenter-controller-5d4ddb489-2bbbw_5761f4fc-8a55-4b6f-9f34-dabbc0490595 became leader"} 2022-02-09T10:40:31.677Z DEBUG controller.events Normal {"commit": "df57892", "object": {"kind":"Lease","namespace":"kube-system","name":"karpenter-leader-election","uid":"771a198f-6725-4573-92de-13329f88c97c","apiVersion":"coordination.k8s.io/v1","resourceVersion":"5781"}, "reason": "LeaderElection", "message": "karpenter-controller-5d4ddb489-2bbbw_5761f4fc-8a55-4b6f-9f34-dabbc0490595 became leader"} 2022-02-09T10:40:31.677Z INFO controller.controller.counter Starting EventSource {"commit": "df57892", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "source": "kind source: /, Kind="} 2022-02-09T10:40:31.677Z INFO controller.controller.counter Starting EventSource {"commit": "df57892", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "source": "kind source: /, Kind="} 2022-02-09T10:40:31.677Z INFO controller.controller.counter Starting Controller {"commit": "df57892", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner"} 2022-02-09T10:40:31.677Z INFO controller.controller.provisioning Starting EventSource {"commit": "df57892", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "source": "kind source: /, Kind="} 2022-02-09T10:40:31.677Z INFO controller.controller.provisioning Starting Controller {"commit": "df57892", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner"} 2022-02-09T10:40:31.678Z INFO controller.controller.volume Starting EventSource {"commit": "df57892", "reconciler group": "", "reconciler kind": "PersistentVolumeClaim", "source": "kind source: /, Kind="} 2022-02-09T10:40:31.679Z INFO controller.controller.volume Starting EventSource {"commit": "df57892", "reconciler group": "", "reconciler kind": "PersistentVolumeClaim", "source": "kind source: /, Kind="} 2022-02-09T10:40:31.679Z INFO controller.controller.volume Starting Controller {"commit": "df57892", "reconciler group": "", "reconciler kind": "PersistentVolumeClaim"} 2022-02-09T10:40:31.679Z INFO controller.controller.termination Starting EventSource {"commit": "df57892", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="} 2022-02-09T10:40:31.679Z INFO controller.controller.termination Starting Controller {"commit": "df57892", "reconciler group": "", "reconciler kind": "Node"} 2022-02-09T10:40:31.679Z INFO controller.controller.node Starting EventSource {"commit": "df57892", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="} 2022-02-09T10:40:31.679Z INFO controller.controller.node Starting EventSource {"commit": "df57892", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="} 2022-02-09T10:40:31.679Z INFO controller.controller.node Starting EventSource {"commit": "df57892", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="} 2022-02-09T10:40:31.679Z INFO controller.controller.node Starting Controller {"commit": "df57892", "reconciler group": "", "reconciler kind": "Node"} 2022-02-09T10:40:31.680Z INFO controller.controller.podmetrics Starting EventSource {"commit": "df57892", "reconciler group": "", "reconciler kind": "Pod", "source": "kind source: /, Kind="} 2022-02-09T10:40:31.680Z INFO controller.controller.podmetrics Starting Controller {"commit": "df57892", "reconciler group": "", "reconciler kind": "Pod"} 2022-02-09T10:40:31.680Z INFO controller.controller.nodemetrics Starting EventSource {"commit": "df57892", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="} 2022-02-09T10:40:31.680Z INFO controller.controller.nodemetrics Starting EventSource {"commit": "df57892", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="} 2022-02-09T10:40:31.680Z INFO controller.controller.nodemetrics Starting EventSource {"commit": "df57892", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="} 2022-02-09T10:40:31.680Z INFO controller.controller.nodemetrics Starting Controller {"commit": "df57892", "reconciler group": "", "reconciler kind": "Node"} 2022-02-09T10:40:31.792Z INFO controller.controller.provisioning Starting workers {"commit": "df57892", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "worker count": 10} 2022-02-09T10:40:31.792Z INFO controller.controller.termination Starting workers {"commit": "df57892", "reconciler group": "", "reconciler kind": "Node", "worker count": 10} 2022-02-09T10:40:31.792Z INFO controller.controller.podmetrics Starting workers {"commit": "df57892", "reconciler group": "", "reconciler kind": "Pod", "worker count": 1} 2022-02-09T10:40:31.793Z INFO controller.controller.counter Starting workers {"commit": "df57892", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "worker count": 10} 2022-02-09T10:40:31.794Z INFO controller.controller.volume Starting workers {"commit": "df57892", "reconciler group": "", "reconciler kind": "PersistentVolumeClaim", "worker count": 1} 2022-02-09T10:40:31.794Z INFO controller.controller.node Starting workers {"commit": "df57892", "reconciler group": "", "reconciler kind": "Node", "worker count": 10} 2022-02-09T10:40:31.794Z INFO controller.controller.nodemetrics Starting workers {"commit": "df57892", "reconciler group": "", "reconciler kind": "Node", "worker count": 1} 2022-02-09T10:40:32.779Z DEBUG controller.provisioning Discovered 363 EC2 instance types {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:40:32.927Z DEBUG controller.provisioning Discovered subnets: [subnet-xxxxxxxxxxxx (us-east-1c) subnet-xxxxxxxxxxx (us-east-1f) subnet-xxxxxxxxxx (us-east-1d)] {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:40:33.109Z DEBUG controller.provisioning Discovered EC2 instance types zonal offerings {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:40:33.112Z INFO controller.provisioning Waiting for unschedulable pods {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:41:24.970Z INFO controller.provisioning Batched 1 pods in 1.000444952s {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:41:25.078Z INFO controller.provisioning Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [t3a.xlarge t3.xlarge c5.2xlarge c5a.2xlarge c5n.2xlarge] {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:41:25.205Z DEBUG controller.provisioning Discovered security groups: [sg-yyyyyyyyyy] {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:41:25.208Z DEBUG controller.provisioning Discovered kubernetes version 1.20 {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:41:25.269Z DEBUG controller.provisioning Discovered ami ami-061cfc84ea62ca4f8 for query /aws/service/eks/optimized-ami/1.20/amazon-linux-2/recommended/image_id {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:41:25.269Z DEBUG controller.provisioning Discovered caBundle, length 1066 {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:41:25.452Z DEBUG controller.provisioning Created launch template, Karpenter-dev-843935707607136058 {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:41:28.346Z INFO controller.provisioning Launched instance: i-051e12dc74c106ece, hostname: ip-10-16-33-154.ec2.internal, type: t3.xlarge, zone: us-east-1c, capacityType: spot {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:41:28.378Z INFO controller.provisioning Bound 1 pod(s) to node ip-10-16-33-154.ec2.internal {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:41:28.378Z INFO controller.provisioning Waiting for unschedulable pods {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:45:33.950Z DEBUG controller.provisioning Discovered 363 EC2 instance types {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:45:34.078Z DEBUG controller.provisioning Discovered subnets: [subnet-xxxxxxxxxxxx (us-east-1c) subnet-xxxxxxxxxxx (us-east-1f) subnet-xxxxxxxxxx (us-east-1d)] {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:45:34.277Z DEBUG controller.provisioning Discovered EC2 instance types zonal offerings {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:50:35.017Z DEBUG controller.provisioning Discovered 363 EC2 instance types {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:50:35.145Z DEBUG controller.provisioning Discovered subnets: [subnet-xxxxxxxxxxxx (us-east-1c) subnet-xxxxxxxxxxx (us-east-1f) subnet-xxxxxxxxxx (us-east-1d)] {"commit": "df57892", "provisioner": "spot"} 2022-02-09T10:50:35.327Z DEBUG controller.provisioning Discovered EC2 instance types zonal offerings {"commit": "df57892", "provisioner": "spot"} ```

And here how the pods were allocated: karpenter

gmcoringa commented 2 years ago

As noted by @ellistarn my issue is not related to karpenter, but the way the kubernetes scheduler works. The documentation about topology constraints point this limitation: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#known-limitations.

Will karpenter be able to circunvent this limitation in the future?

ellistarn commented 2 years ago

I don't know of a path forward to circumvent the kube scheduler, beyond becoming a custom scheduler.

Podantiaffinity will help with this, as the spec doesn't allow for multiple pods per topology key, where topology spread does.

If it were possible to add a "max" to topology, instead of "max skew", this would solve the problem as well, but it would need to happen upstream.

If you change the scheduler name in the pod spec to something other than default, it will skip the kube scheduler and karpenter will keep working. However, karpenter doesn't reuse existing capacity (it relies on kube scheduler), so you will always get new nodes for these pods in this configuration. It's definitely a bit of a hack.

olemarkus commented 2 years ago

Is karpenter aware of faulty AZs? One of the big challenges with topology constraints and faulty AZs is that both ASGs and kube-scheduler will want to launch instances/launch workloads in faulty AZs.

ellistarn commented 2 years ago

I don't think you would want to temporarily avoid spread due to a faulty AZ. You sacrifice static stability if you try to detect outages and evacuate. If we can't get capacity for a pod that wants to run in AZ, we need to keep retrying until it succeeds. This is how the kubescheduler works as well.

olemarkus commented 2 years ago

Sort of. When there is an outage, the nodes in the faulty AZ goes away and the scheduler can carry on. If karpenter still insists on using a faulty AZ, it'll make the situation worse.

github-actions[bot] commented 2 years ago

Labeled for closure due to inactivity in 10 days.

javydekoning commented 2 years ago

Re-opening this because I think this might be unexpected. I have a brand new cluster setup with Karpenter running in Fargate. All nodes are spun-up in a Single AZ, all on the Same Instance type (which is not great for spot diversification).

This is my infra:

import { Stack, IResource, StackProps, aws_eks as eks } from 'aws-cdk-lib';
import * as blueprints from '@aws-quickstart/eks-blueprints';
import { Construct } from 'constructs';
import { IVpc } from 'aws-cdk-lib/aws-ec2';

export interface EksLabStackProps extends StackProps {
  vpc: IVpc;
}

export class EksLabStack extends Stack {
  constructor(scope: Construct, id: string, props: EksLabStackProps) {
    super(scope, id, props);

    const clusterProvider = new blueprints.GenericClusterProvider({
      version: eks.KubernetesVersion.V1_21,
      fargateProfiles: {
        karpenter: {
          fargateProfileName: 'karpenter',
          selectors: [{ namespace: 'karpenter' }],
        },
      },
    });

    const addOns: Array<blueprints.ClusterAddOn> = [
      new blueprints.addons.AwsLoadBalancerControllerAddOn(),
      new blueprints.addons.CalicoOperatorAddOn(),
      new blueprints.addons.CoreDnsAddOn(),
      new blueprints.addons.KubeProxyAddOn(),
      new blueprints.addons.MetricsServerAddOn(),
      new blueprints.addons.VpcCniAddOn(),
      new blueprints.addons.KarpenterAddOn({
        amiFamily: 'AL2',
        provisionerSpecs: {
          'karpenter.sh/capacity-type': ['spot'],
          "kubernetes.io/arch": ["amd64","arm64"],
          "topology.kubernetes.io/zone": ["eu-west-1a", "eu-west-1b"]
        },
        subnetTags: {
          'karpenter.sh/discovery': 'eks-lab',
        },
        securityGroupTags: {
          'karpenter.sh/discovery': 'eks-lab',
        },
      }),
    ];

    const resourceProviders = new Map<string,blueprints.ResourceProvider<IResource>>([
      [
        blueprints.GlobalResources.Vpc,
        new blueprints.DirectVpcProvider(props.vpc),
      ],
    ]);

    new blueprints.EksBlueprint(
      this,
      {
        addOns,
        clusterProvider,
        resourceProviders,
        id: 'eks-lab',
      },
      props
    );
  }
}

Result:

kubectl get nodes -o wide
NAME                                                STATUS   ROLES    AGE   VERSION                INTERNAL-IP     EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
fargate-ip-10-100-36-0.eu-west-1.compute.internal   Ready    <none>   21h   v1.21.9-eks-14c7a48    10.100.36.0     <none>        Amazon Linux 2   4.14.281-212.502.amzn2.x86_64   containerd://1.4.13
ip-10-100-17-193.eu-west-1.compute.internal         Ready    <none>   24h   v1.21.12-eks-5308cf7   10.100.17.193   <none>        Amazon Linux 2   5.4.204-113.362.amzn2.x86_64    containerd://1.4.13
ip-10-100-23-20.eu-west-1.compute.internal          Ready    <none>   24h   v1.21.12-eks-5308cf7   10.100.23.20    <none>        Amazon Linux 2   5.4.204-113.362.amzn2.x86_64    containerd://1.4.13
ip-10-100-23-21.eu-west-1.compute.internal          Ready    <none>   24h   v1.21.12-eks-5308cf7   10.100.23.21    <none>        Amazon Linux 2   5.4.204-113.362.amzn2.x86_64    containerd://1.4.13

kubectl get pods -A -o wide
NAMESPACE          NAME                                              READY   STATUS    RESTARTS   AGE   IP              NODE                                                NOMINATED NODE   READINESS GATES
calico-apiserver   calico-apiserver-5d4577557c-66tm7                 1/1     Running   0          24h   10.100.18.219   ip-10-100-23-21.eu-west-1.compute.internal          <none>           <none>
calico-apiserver   calico-apiserver-5d4577557c-jjxb6                 1/1     Running   0          24h   10.100.22.55    ip-10-100-17-193.eu-west-1.compute.internal         <none>           <none>
calico-operator    tigera-operator-57b5454687-z7jm7                  1/1     Running   0          24h   10.100.23.21    ip-10-100-23-21.eu-west-1.compute.internal          <none>           <none>
calico-system      calico-kube-controllers-57f88bc9fd-g2pjm          1/1     Running   0          24h   10.100.16.132   ip-10-100-17-193.eu-west-1.compute.internal         <none>           <none>
calico-system      calico-node-d5kfv                                 1/1     Running   0          24h   10.100.23.21    ip-10-100-23-21.eu-west-1.compute.internal          <none>           <none>
calico-system      calico-node-vs6bc                                 1/1     Running   0          24h   10.100.23.20    ip-10-100-23-20.eu-west-1.compute.internal          <none>           <none>
calico-system      calico-node-wmdrr                                 1/1     Running   0          24h   10.100.17.193   ip-10-100-17-193.eu-west-1.compute.internal         <none>           <none>
calico-system      calico-typha-5857f899bd-v94bz                     1/1     Running   0          24h   10.100.17.193   ip-10-100-17-193.eu-west-1.compute.internal         <none>           <none>
calico-system      calico-typha-5857f899bd-zfd2p                     1/1     Running   0          24h   10.100.23.21    ip-10-100-23-21.eu-west-1.compute.internal          <none>           <none>
karpenter          blueprints-addon-karpenter-7bb874498-n8kmj        2/2     Running   0          21h   10.100.36.0     fargate-ip-10-100-36-0.eu-west-1.compute.internal   <none>           <none>
kube-system        aws-load-balancer-controller-7cb845b549-t79qw     1/1     Running   0          22h   10.100.30.97    ip-10-100-23-20.eu-west-1.compute.internal          <none>           <none>
kube-system        aws-load-balancer-controller-7cb845b549-xcm7b     1/1     Running   0          22h   10.100.29.92    ip-10-100-17-193.eu-west-1.compute.internal         <none>           <none>
kube-system        aws-node-5d2bw                                    1/1     Running   0          24h   10.100.17.193   ip-10-100-17-193.eu-west-1.compute.internal         <none>           <none>
kube-system        aws-node-gjftc                                    1/1     Running   0          24h   10.100.23.21    ip-10-100-23-21.eu-west-1.compute.internal          <none>           <none>
kube-system        aws-node-ngtzj                                    1/1     Running   0          24h   10.100.23.20    ip-10-100-23-20.eu-west-1.compute.internal          <none>           <none>
kube-system        blueprints-addon-metrics-server-c758cc974-dwr58   1/1     Running   0          22h   10.100.29.4     ip-10-100-23-20.eu-west-1.compute.internal          <none>           <none>
kube-system        coredns-7cc879f8db-2hjl5                          1/1     Running   0          21h   10.100.23.66    ip-10-100-23-20.eu-west-1.compute.internal          <none>           <none>
kube-system        coredns-7cc879f8db-fx2wh                          1/1     Running   0          21h   10.100.25.232   ip-10-100-17-193.eu-west-1.compute.internal         <none>           <none>
kube-system        kube-proxy-fz2gg                                  1/1     Running   0          24h   10.100.17.193   ip-10-100-17-193.eu-west-1.compute.internal         <none>           <none>
kube-system        kube-proxy-mmtzq                                  1/1     Running   0          24h   10.100.23.21    ip-10-100-23-21.eu-west-1.compute.internal          <none>           <none>
kube-system        kube-proxy-twgjr                                  1/1     Running   0          24h   10.100.23.20    ip-10-100-23-20.eu-west-1.compute.internal          <none>           <none>

All instances are of type m5.large and landed in the same AZ all-though 2 AZ's are discovered.

controller.node-state   Discovered subnets: [subnet-0058d68fa0e6c93fa (eu-west-1a) subnet-0ed7d9dc5dac8e058 (eu-west-1b)]

aws ec2 describe-instances | jq '.Reservations[].Instances[] | {Type: .InstanceType, DNS: .PrivateDnsName, AZ: .Placement.AvailabilityZone}'


{
  "Type": "m5.large",
  "DNS": "ip-10-100-23-21.eu-west-1.compute.internal",
  "AZ": "eu-west-1a"
}
{
  "Type": "m5.large",
  "DNS": "ip-10-100-17-193.eu-west-1.compute.internal",
  "AZ": "eu-west-1a"
}
{
  "Type": "m5.large",
  "DNS": "ip-10-100-23-20.eu-west-1.compute.internal",
  "AZ": "eu-west-1a"
}

tzneal commented 2 years ago

This is similar to https://github.com/aws/karpenter/issues/1810 . Kubernetes provides native methods to indicate which workloads you want to spread across AZs. You'll need to add topology spread constraints to your workloads. See https://karpenter.sh/v0.13.2/tasks/scheduling/#topology-spread and https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/ .

fabiocarvalho-planet commented 2 years ago

Hello guys, I've tried several combinations of PodAffinity/AntiAffinity or TopologySpread as @tzneal mentioned, trying to keep pods away from each other in nodes and in AZs. Karpenter discovered my 3 subnets BTW.

But in every single test round (blasting 20 "inflate" pods) at the cluster, I got the same behavior described by @rolindroy: Karpenter will provision enough nodes to cope with the demand but they are all in the same subnet/zone. In every test round, one different subnet was chosen but then all 4~5 nodes were created in it (in only one subnet/AZ). I think, for the sake of availability, that these new nodes should be spread over the subnets/AZs.

Regarding Affinity and Topology Spread, anyone has a working configuration example to share?

This actual behavior on Karpenter is the expected one? I think we should be able to choose how it spreads the nodes. Kubernetes Scheduler will use whatever nodes are available respecting options like the Affinity and Topology, so it is Karpenter's job to provision them over different subnets/AZs if we tell it to do so.

Regards!

fabiocarvalho-planet commented 2 years ago

Hello guys, I've tried several combinations of PodAffinity/AntiAffinity or TopologySpread as @tzneal mentioned, trying to keep pods away from each other in nodes and in AZs. Karpenter discovered my 3 subnets BTW.

But in every single test round (blasting 20 "inflate" pods) at the cluster, I got the same behavior described by @rolindroy: Karpenter will provision enough nodes to cope with the demand but they are all in the same subnet/zone. In every test round, one different subnet was chosen but then all 4~5 nodes were created in it (in only one subnet/AZ). I think, for the sake of availability, that these new nodes should be spread over the subnets/AZs.

Regarding Affinity and Topology Spread, anyone has a working configuration example to share?

This actual behavior on Karpenter is the expected one? I think we should be able to choose how it spreads the nodes. Kubernetes Scheduler will use whatever nodes are available respecting options like the Affinity and Topology, so it is Karpenter's job to provision them over different subnets/AZs if we tell it to do so.

Regards!

Guys, new findings. Got help from a colleague digging the docs (tks Alex!).

I've tested the deployment with Topology Spread Constraints, using the topologyKey "topology.kubernetes.io/zone" but with the parameter whenUnsatisfiable value set to "DoNotSchedule" and this did the trick for me. Using "ScheduleAnyway" value there doesn't drive Karpenter to try to spread nodes. At the "kubernetes.io/hostname" topologyKey I left it with "ScheduleAnyway".

In the Karpenter docs, this section (https://karpenter.sh/v0.16.3/tasks/scheduling/#topology-spread) should be more specific regarding this topic. The definitions on Kubernetes docs do not make this clear also: (https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#spread-constraint-definition).

junowong0114 commented 1 year ago

I've tested the deployment with Topology Spread Constraints, using the topologyKey "topology.kubernetes.io/zone" but with the parameter whenUnsatisfiable value set to "DoNotSchedule" and this did the trick for me. Using "ScheduleAnyway" value there doesn't drive Karpenter to try to spread nodes. At the "kubernetes.io/hostname" topologyKey I left it with "ScheduleAnyway".

I expect Karpenter to balance number of nodes across AZs without changing pod specs though, both during node provisioning and consolidation. Can this be included in the roadmap?

Karpenter to balance number of nodes between AZs per provisioner, both during node provisioning and consolidation
User be allowed to configure maximum skew, that is the degree of imbalance between number of nodes in each AZs (sorry for reusing the word maxSkew, maybe someone that speaks English natively can help me with the wordings?)

Please reopen this issue?

junowong0114 commented 1 year ago

All the nodes were placed in the same subnet and same az (u-west-1a).

When using spot, Karpenter will choose the cheapest instance type. In this case, it looks like us-west-1a was the cheapest.

Is it possible to use different provisioning logic, just like expander with Cluster Autoscaler? CA even allows user to list the expanders to use in descending order of importance.

At least, this should be stated clearly in the doc, that Karpenter choose the cheapest instance type when using spot instance, also that Karpenter does not consider AZ balancing, users have to configure pod topology spread with whenUnsatisfiable: DoNotSchedule property to ensure AZ spread.

FernandoMiguel commented 1 year ago

there ain't lots of place it is referred, but this bit of the FAQ covers it

At least, this should be stated clearly in the doc, that Karpenter choose the cheapest instance type when using spot instance https://karpenter.sh/v0.19.2/faq/#how-does-karpenter-dynamically-select-instance-types

Karpenter recently changed from capacity-optimized-prioritized to price-capacity-optimized

re: topology, I agree a better job could be done here. Me too when I started using Karpenter was surprised why it wasn't scheduling pods in different AZs. But Karpenter doesn't know your workloads, doesn't know if they are sensitive to AZ costs for ex. So it's up to the operator to spec the jobs to be topology aware or not.

And once that is coded, kube scheduler will place them accordingly.

petewilcock commented 1 year ago

/reopen

please - I get the argument on pricing, however often the spot price will be identical across zones, so in that circumstance I'd expect Karpenter to bias a zone spread as a default.

ellistarn commented 1 year ago

Thanks for continuing to explore this issue. I could see an argument for providing a configuration knob at the provisioner level. e.g.

kind: Provisioner
spec: 
  requirements: ...
  topologySpreadConstraints: # Spread nodes

We'd need to think through the implications of it, though.

We definitely can't enable spread for spot by default, as it goes against Karpenter's cost optimization goals.

nparfait commented 1 year ago

I would agree, I still want to be able to have the provisioner spread my nodes across AZs even if using spot instances. I still want the cost benefits of spot instances but to also have them evenly provisioned across AZs.

FernandoMiguel commented 1 year ago

i agree with @nparfait

jonathan-innis commented 1 year ago

@rolindroy @junowong0114 @nparfait Have you tried looking at https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#cluster-level-default-constraints to define cluster-wide topologySpreadConstraints that could spread the pods across the nodes?

The concern with doing implied toplogySpreadConstraints without the workload requirement is that there is no hard requirement for the kube-scheduler to schedule those pods across topologies which means that we could launch nodes across topologies (say 3 nodes across 3 domains) and then the kube-scheduler binds all pods to two nodes, meaning that one is empty and we would deprovision one of the nodes in one of those domains.

nparfait commented 1 year ago

@rolindroy @junowong0114 @nparfait Have you tried looking at https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#cluster-level-default-constraints to define cluster-wide topologySpreadConstraints that could spread the pods across the nodes?

The concern with doing implied toplogySpreadConstraints without the workload requirement is that there is no hard requirement for the kube-scheduler to schedule those pods across topologies which means that we could launch nodes across topologies (say 3 nodes across 3 domains) and then the kube-scheduler binds all pods to two nodes, meaning that one is empty and we would deprovision one of the nodes in one of those domains.

This isn't to do with pod scheduling. This is do with nodes being placed in different AZs when multiple spot instances are provisioned

jonathan-innis commented 1 year ago

in different AZs when multiple spot instances are provisioned

What's the reason you want to spread them this way? Is it to reduce blast radius for AZ outage? To reduce the chance of being reclaimed? Why do you want this at the node level and not at the application-level (which is generally where I assume that the resiliency requirement needs to lie)

nparfait commented 1 year ago

in different AZs when multiple spot instances are provisioned

What's the reason you want to spread them this way? Is it to reduce blast radius for AZ outage? To reduce the chance of being reclaimed? Why do you want this at the node level and not at the application-level (which is generally where I assume that the resiliency requirement needs to lie)

I do have my pods spread across different nodes in different AZs. The issue here was i had karpenter provision 2 spot nodes in the same AZ instead of across 2 AZs.

jonathan-innis commented 1 year ago

I do have my pods spread across different nodes in different AZs

I'm confused then. Do your workloads have topologySpreadConstraints with the topologyKey set to be topology.kubernetes.io/zone and a maxSkew of 1? If so, Karpenter should spread these pods across zones and start provisioning nodes evenly across zones.

dougbyrne commented 1 year ago

@rolindroy @junowong0114 @nparfait Have you tried looking at https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#cluster-level-default-constraints to define cluster-wide topologySpreadConstraints that could spread the pods across the nodes?

The concern with doing implied toplogySpreadConstraints without the workload requirement is that there is no hard requirement for the kube-scheduler to schedule those pods across topologies which means that we could launch nodes across topologies (say 3 nodes across 3 domains) and then the kube-scheduler binds all pods to two nodes, meaning that one is empty and we would deprovision one of the nodes in one of those domains.

I have a set of test environment workloads that are in different namespaces that I'd like to have spread across multiple AZs so that I can avoid resource exhaustion. See #2921

The kube topologySpreadConstraints was enhanced with minDomains so that it can spread the pods across multiple AZs even when no nodes exist in a AZ yet. However, it only does this within a specific namespace. I don't have enough pods within each namespace to get effective spread.

Since this is a test environment, I'm not concerned about high availability of any one test env. At most, I am hoping for quick failover in the event of a zone failure. Having some existing capacity in other zones would help. I'm also hoping that some AZ spread would help mitigate the IP exhaustion that can occur when all nodes are in a single zone/subnet.

Improvements towards IP address exhaustion in #2921 would reduce my desire for this feature.

tzneal commented 1 year ago

You can apply the same label to multiple different deployments and use a topology spread across that instead:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
        spread: myspread

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres-deployment
  labels:
    app: postgres
spec:
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
        spread: myspread

    topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              spread: myspread

I don't expect we'll start intentionally putting nodes in different AZs unless requested via scheduling constraints on the workloads themselves.

midu-git commented 1 year ago

https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#cluster-level-default-constraints

@jonathan-innis It is a know limitation of topology spread constraints that the set of available values of a target topology, e.g. topology.kubernetes.io/zone, is not known by the scheduler, c.f. https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#known-limitations Thus, if there are nodes provisioned in one availability zone only, the scheduler will happily schedule a new pod, as the number of total pods in each topology domain minus the minimum number of pods over all topology domains is zero, in this case. To overcome the problem, you must somehow control how nodes are provisioned across AZs before the scheduler starts his job on scheduling particular pods. More precisely, you need to have a strategy that provides all required values for the label topology.kubernetes.io/zone before requesting pod scheduling from the scheduler.

Putting all together, I'm not sure if Karpenter can solve this more general problem, while I agree that Karpenter should allow for provisioning across different AZs if requested, even when requesting SPOT instances.

tuxillo commented 8 months ago

Thanks @midu-git , what do you use then, cluster autoscaler Anyways, why is this issue close then? Shouldn't it be at least documented as a limitation then?

aws / karpenter-provider-aws