Ephemeral volume claims not taken into account when scheduling?

spmason commented 2 years ago

Version

Karpenter Version: v0.16.3

Kubernetes Version: v1.21.14

Expected Behavior

Karpenter should take into account ephemeral volumeClaimTemplates when scheduling pods to nodes

Actual Behavior

Karpenter does not appear to take ephemeral storage into account when deciding whether it can schedule a pod to a node, and thus does not spin up new nodes to meet new additional capacity

Steps to Reproduce the Problem

I have a number of pods with the following spec:

  resources:
    limits:
      cpu: 6
      memory: 45G
    requests:
      cpu: 5
      memory: 45G
...
  volumes:
   - ephemeral:
        volumeClaimTemplate:
...
          spec:
            accessModes: ["ReadWriteOnce"]
            storageClassName: localnvme
            resources:
              requests:
                storage: 1850G

I've got topolvm provisioning the localnvme storage on nodes based on the local NVM drives in the machines

My situation is that I run 4 of these pods and specify the i4i family in my provisioner spec - Karpenter spins up an i4i.8xlarge to run them on, which fills up as so:

20 / 32 CPU in-use
180G / 256G Memory in-use
7400G / 7500G local NVMe in-use by ephemeral volume claim templates

Now if I scale my deployment up to 5 pods Karpenter decides that that pod can schedule to the existing node - the CPU and memory "fits" but the storage does not, so k8s refuses to schedule the pod there and it sits in "Pending" status with the following Events:

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Normal   Nominate          5m13s (x58 over 120m)  karpenter          Pod should schedule on <node with 4 pods already running>
  Warning  FailedScheduling  12s (x125 over 116m)   default-scheduler  0/20 nodes are available: 1 node(s) did not have enough free storage, 1 node(s) had taint {tainted: }, that the pod didn't tolerate, 10 node(s) didn't match Pod's node affinity/selector, 4 Insufficient
memory, 8 Insufficient cpu.

Note that in this case I can just increase the cpu/memory requests of my pods to make those cause karpenter to make the correct scheduling decision, but this fall apart as you go up to bigger node sizes and different instance families with different specs

Resource Specs and Logs

See steps to reproduce

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

ellistarn commented 2 years ago

We don't currently support this, but it would be very interesting to explore this. Future readers, remember to +1 if this is interesting to you.

Roberdvs commented 3 months ago

We sometimes encounter another issue that I believe should also be solved by supporting this.

When pods have ephemeral volumes backed by EBS they sometimes get stuck in the CreatingContainer state, seemingly because Karpenter is not taking the ephemeral volume into account when scheduling the pod in the proper zone as it does with the Persistent Volume Topology capability, and it's trying to schedule it in a zone different to where the EBS volume was created.

Some logs:

Error:

  Warning  FailedAttachVolume  3m36s (x31 over 109m)  attachdetach-controller  (combined from similar events): AttachVolume.Attach failed for volume "pvc-*************" : rpc error: code = Internal desc = Could not attach volume "vol-*************" to node "i-*************": could not attach volume "vol-*************" to node "i-*************": operation error EC2: AttachVolume, https response error StatusCode: 400, RequestID: *************, api error InvalidVolume.ZoneMismatch: The volume 'vol-*************' is not in the same availability zone as instance 'i-*************'

zemliany commented 2 months ago

angelbarrera92 commented 3 weeks ago

This issue is hitting us too.

From the karpenter logs, whenever there is a new instance it reports that its ephemeral volume capacity is 17gb. (I guess this is configurable somehow).

karpenter-5d77c9f5-6vwbx controller {"level":"INFO","time":"2024-10-09T07:27:22.941Z","logger":"controller.machine.lifecycle","message":"launched machine","commit":"34d50bf-dirty","machine":"spot-provisioner-west-1c-karp1-c-jpq7k","provisioner":"spot-provisioner-west-1c-karp1-c","provider-id":"aws:///eu-west-1c/i-0741fd2982c837e92","instance-type":"m6a.4xlarge","zone":"eu-west-1c","capacity-type":"spot","allocatable":{"cpu":"15890m","ephemeral-storage":"17Gi","memory":"57691Mi","pods":"110"}}

Then, if we create a deployment using ephemeral.volumeClaimTemplate (EBS instead of using the root volume):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-ephemeral-volumes
  namespace: co
spec:
  replicas: 0
  selector:
    matchLabels:
      app: test-ephemeral-volumes
  template:
    metadata:
      labels:
        app: test-ephemeral-volumes
    spec:
      tolerations:
        - effect: NoSchedule
          key: dedicated
          operator: Equal
          value: karp1
      nodeSelector:
        managed-by: karpenter
        nodepool: karp1
      containers:
      - name: shell
        image: alpine:3.20
        command: ["/bin/sh"]
        args: ["-c", "while true; do date > /tmp/emptydir/date; sleep 1; done"]
        volumeMounts:
          - name: my-emptydir
            mountPath: /tmp/emptydir
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
          runAsUser: 1000
          runAsGroup: 1000
          runAsNonRoot: true
          readOnlyRootFilesystem: true
          seccompProfile:
            type: RuntimeDefault
        resources:
          requests:
            memory: "16Mi"
            cpu: "10m"
          limits:
            memory: "32Mi"
            cpu: "20m"
      volumes:
      # Use an ephemeral volume for the emptyDir
        - name: my-emptydir
          ephemeral:
            volumeClaimTemplate:
              spec:
                accessModes: [ "ReadWriteOnce" ]
                resources:
                  requests:
                    storage: 1Gi

And scale it to 16 replicas, the node can handle these 16 replicas. (16gb of "ephemeral volumes" in use). Whenever we schedule another 2 replicas, karpenter spins up another node. This is unexpected because of:

4xlarge nodes supports 32 ebs (source)
ephemeral volume usage (from the node) is zero (reported from the kubectl describe node command).

Capacity:
  cpu:                16
  ephemeral-storage:  183411200Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             129121204Ki
  pods:               110
Allocatable:
  cpu:                15400m
  ephemeral-storage:  182362624Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             127212468Ki
  pods:               110

Any chance to make this behavior work?

aws / karpenter-provider-aws