aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.73k stars 944 forks source link

Karpenter allocatable memory calculation #5104

Closed ricardomiguel-os closed 11 months ago

ricardomiguel-os commented 11 months ago

Description

Observed Behavior: A pending pod requesting 768Mi memory cannot be scheduled because Karpenter provisions a node that doesn't have enough memory.

It seems that Karpenter is failing to calculate allocatable memory.

Expected Behavior: Karpenter should provision a node with the necessary resources to schedule the pod.

Reproduction Steps (Please include YAML):

1º Use Case - Without reserved capacity configured

Bottlerocket User Data

  userData: |
    [settings.kernel[]
    lockdown = "integrity"
    [settings.host-containers.control[]
    enabled = true
    [settings.host-containers.admin[]
    enabled = true
    [settings.kubernetes.eviction-hard[]
    "memory.available" = "5%"

Karpenter logs

2023-11-17T09:46:47.999Z    INFO    controller.provisioner    found provisionable pod(s)    {"commit": "61b3e1e-dirty", "pods": "pod-ns/my-pod", "duration": "97.756968ms"}
2023-11-17T09:46:47.999Z    INFO    controller.provisioner    computed new machine(s) to fit pod(s)    {"commit": "61b3e1e-dirty", "machines": 1, "pods": 1}
2023-11-17T09:46:48.082Z    INFO    controller.provisioner    created machine    {"commit": "61b3e1e-dirty", "provisioner": "my_prov", "machine": "my_prov-rqvzj", "requests": {"cpu":"1366m","memory":"3048Mi","pods":"14"}, "instance-types": "c3.xlarge, c4.xlarge, c5.2xlarge, c5.large, c5.xlarge and 95 other(s)"}
2023-11-17T09:46:50.604Z    INFO    controller.machine.lifecycle    launched machine    {"commit": "61b3e1e-dirty", "machine": "my_prov-rqvzj", "provisioner": "my_prov", "provider-id": "aws:///us-east-1c/i-051d5bf72c4f8zzz", "instance-type": "c6a.large", "zone": "us-east-1c", "capacity-type": "on-demand", "allocatable": {"cpu":"1930m","ephemeral-storage":"89Gi","memory":"3114Mi","pods":"29"}}
2023-11-17T09:46:57.226Z    INFO    controller.deprovisioning    deprovisioning via consolidation delete, terminating 1 machines ip-10-64-129-11.ec2.internal/c6a.large/on-demand    {"commit": "61b3e1e-dirty"}
2023-11-17T09:46:57.300Z    INFO    controller.termination    cordoned node    {"commit": "61b3e1e-dirty", "node": "ip-10-64-129-11.ec2.internal"}
2023-11-17T09:46:57.705Z    INFO    controller.termination    deleted node    {"commit": "61b3e1e-dirty", "node": "ip-10-64-129-11.ec2.internal"}
2023-11-17T09:46:58.069Z    INFO    controller.machine.termination    deleted machine    {"commit": "61b3e1e-dirty", "machine": "my_prov-hcznj", "provisioner": "my_prov", "node": "ip-10-64-129-11.ec2.internal", "provider-id": "aws:///us-east-1c/i-051d5bf72c4f8zzz"}

Provisioned node

Instance type: c6a.large

Capacity:
  cpu:                         2
  ephemeral-storage:           103189828Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      3880460Ki
  pods:                        29
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         1930m
  ephemeral-storage:           102141252Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      3173028862
  pods:                        29
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         966m (50%)    3212m (166%)
  memory                      2280Mi (75%)  6388Mi (211%)
  ephemeral-storage           0 (0%)        0 (0%)
  hugepages-1Gi               0 (0%)        0 (0%)
  hugepages-2Mi               0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0

Memory capacity - Memory allocatable = 3880460Ki - 3173028862 = 3880460Ki - 3098661Ki = 781799Ki = 736Mb → 20,1% lost from the initial capacity.

Without any reserved capacity configured on the Botllerocket AWSNodeTemplate we lost around 20% of the memory from the initial capacity.

Since the pod needs 768Mi, if we sum with the Daemonsets (2280Mi + 768Mi) we get the memory that we need to schedule the pod = 3048Mi

So we need 3048Mi but the allocatable memory is 3026Mi = 3048Mi - 3026Mi = 22Mi missing to schedule the pod

2º Use Case - With reserved capacity configured

Bottlerocket User Data

  userData: |
    [settings.kernel[]
    lockdown = "integrity"
    [settings.host-containers.control[]
    enabled = true
    [settings.host-containers.admin[]
    enabled = true
    [settings.kubernetes.eviction-hard[]
    "memory.available" = "5%"
    [settings.kubernetes.system-reserved]                                                                                                                                
    memory = 256Mi                                                                                                                                                       
    settings.kubernetes.kube-reserved]                                                                                                                                  
    memory = 256Mi

Configured 512Mi of reserved memory

Karpenter logs

2023-11-17T10:12:41.338Z    INFO    controller.provisioner  found provisionable pod(s)  {"commit": "61b3e1e-dirty", "pods": "pod-ns/my-pod", "duration": "42.674421ms"}
2023-11-17T10:12:41.338Z    INFO    controller.provisioner  computed new machine(s) to fit pod(s)   {"commit": "61b3e1e-dirty", "machines": 1, "pods": 1}
2023-11-17T10:12:41.417Z    INFO    controller.provisioner  created machine {"commit": "61b3e1e-dirty", "provisioner": "my_prov", "machine": "my_prov-7c42d", "requests": {"cpu":"1366m","memory":"3048Mi","pods":"14"}, "instance-types": "c3.xlarge, c4.xlarge, c5.2xlarge, c5.large, c5.xlarge and 95 other(s)"}
2023-11-17T10:22:52.816Z    DEBUG    controller.machine.lifecycle    discovered instance types    {"commit": "61b3e1e-dirty", "machine": "my_prov-7c42d", "provisioner": "my_prov", "count": 758}
2023-11-17T10:23:54.050Z    DEBUG    controller.machine.lifecycle    created launch template    {"commit": "61b3e1e-dirty", "machine": "my_prov-7c42d", "provisioner": "my_prov", "launch-template-name": "karpenter.k8s.aws/17625316294275365xxx", "id": "lt-0911a40a5bb5fxxxx"}
2023-11-17T10:23:54.187Z    DEBUG    controller.machine.lifecycle    created launch template    {"commit": "61b3e1e-dirty", "machine": "my_prov-7c42d", "provisioner": "my_prov", "launch-template-name": "karpenter.k8s.aws/3179703506823472xxx", "id": "lt-01d30cf66f3xxxx"}
2023-11-17T10:23:54.344Z    DEBUG    controller.machine.lifecycle    created launch template    {"commit": "61b3e1e-dirty", "machine": "my_prov-7c42d", "provisioner": "my_prov", "launch-template-name": "karpenter.k8s.aws/622317665377390xxxx", "id": "lt-0a18402cd33fxxxx"}
2023-11-17T10:23:56.340Z    INFO    controller.machine.lifecycle    launched machine    {"commit": "61b3e1e-dirty", "machine": "my_prov-7c42d", "provisioner": "my_prov", "provider-id": "aws:///us-east-1c/i-0f32454d9exxxx", "instance-type": "c6a.large", "zone": "us-east-1c", "capacity-type": "on-demand", "allocatable": {"cpu":"1930m","ephemeral-storage":"89Gi","memory":"3114Mi","pods":"29"}}
2023-11-17T10:24:18.791Z    DEBUG    controller.machine.lifecycle    registered machine    {"commit": "61b3e1e-dirty", "machine": "my_prov-7c42d", "provisioner": "my_prov", "provider-id": "aws:///us-east-1c/i-0f32454d9ef8xxxx", "node": "ip-10-64-144-11.ec2.internal"}
2023-11-17T10:24:37.486Z    DEBUG    controller.machine.lifecycle    initialized machine    {"commit": "61b3e1e-dirty", "machine": "my_prov-7c42d", "provisioner": "my_prov", "provider-id": "aws:///us-east-1c/i-0f32454d9ef867xxx", "node": "ip-10-64-144-11.ec2.internal"}

Provisioned node

Instance type: c6a.large

Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         2
  ephemeral-storage:           103189828Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      3880460Ki
  pods:                        29
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         1930m
  ephemeral-storage:           102141252Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      3238040574
  pods:                        29
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1366m (70%)   5712m (295%)
  memory                      3048Mi (98%)  8436Mi (273%)
  ephemeral-storage           0 (0%)        0 (0%)
  hugepages-1Gi               0 (0%)        0 (0%)
  hugepages-2Mi               0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0

After setting up the reserved capacity (512Mi) on the Bottlerocket AWSNodeTemplate we were able to schedule the pod into the machine.

Memory capacity - Memory allocatable = 3880460Ki - 3238040574 = 3880460Ki - 3162149Ki = 718311Ki = 701Mb → 18,5% lost less allocatable memory from the initial capacity with the reserved memory configured.

Again we need 3048Mi but now the the allocatable memory is 3088Mi, thats why we can now fit the pod.

Note: Our provisioners doesn't have the spec.kubeletConfiguration configured

Versions:

ricardomiguel-os commented 11 months ago

After removing the eviction-hard from the AWSNodeTemplate, the scheduler could schedule the pod. Does this mean that Karpenter doesn't consider AWSNodeTemplate user data configuration and the scheduler does?

If we want to change these values, can we only change them on the Provisioners?

The documentation mention that we can configure these fields on user data

tzneal commented 11 months ago

You should set it on the provisioner/nodepool (see https://karpenter.sh/docs/concepts/nodepools/#eviction-thresholds). Those are then copied down to the Bottlerocket userdata. Karpenter needs these values for scheduling, which is why they're at the provisioner/nodepool level.

ricardomiguel-os commented 11 months ago

Ok, maybe this documentation with the eviction-hard configuration misled us because we should configure this at provisioner/nodepool level.

Just one more question If we don't set any values on privisioner/nodepool kubeletConfiguration Karpenter will assume default values right? Are they also copied and overridden to userdata?