Open spmason opened 2 years ago
We don't currently support this, but it would be very interesting to explore this. Future readers, remember to +1 if this is interesting to you.
We sometimes encounter another issue that I believe should also be solved by supporting this.
When pods have ephemeral volumes backed by EBS they sometimes get stuck in the CreatingContainer
state, seemingly because Karpenter is not taking the ephemeral volume into account when scheduling the pod in the proper zone as it does with the Persistent Volume Topology capability, and it's trying to schedule it in a zone different to where the EBS volume was created.
Some logs:
Error:
Warning FailedAttachVolume 3m36s (x31 over 109m) attachdetach-controller (combined from similar events): AttachVolume.Attach failed for volume "pvc-*************" : rpc error: code = Internal desc = Could not attach volume "vol-*************" to node "i-*************": could not attach volume "vol-*************" to node "i-*************": operation error EC2: AttachVolume, https response error StatusCode: 400, RequestID: *************, api error InvalidVolume.ZoneMismatch: The volume 'vol-*************' is not in the same availability zone as instance 'i-*************'
could be similar to https://github.com/aws/karpenter-provider-aws/issues/2394
This issue is hitting us too.
From the karpenter
logs, whenever there is a new instance it reports that its ephemeral volume capacity is 17gb. (I guess this is configurable somehow).
karpenter-5d77c9f5-6vwbx controller {"level":"INFO","time":"2024-10-09T07:27:22.941Z","logger":"controller.machine.lifecycle","message":"launched machine","commit":"34d50bf-dirty","machine":"spot-provisioner-west-1c-karp1-c-jpq7k","provisioner":"spot-provisioner-west-1c-karp1-c","provider-id":"aws:///eu-west-1c/i-0741fd2982c837e92","instance-type":"m6a.4xlarge","zone":"eu-west-1c","capacity-type":"spot","allocatable":{"cpu":"15890m","ephemeral-storage":"17Gi","memory":"57691Mi","pods":"110"}}
Then, if we create a deployment using ephemeral.volumeClaimTemplate
(EBS instead of using the root volume):
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-ephemeral-volumes
namespace: co
spec:
replicas: 0
selector:
matchLabels:
app: test-ephemeral-volumes
template:
metadata:
labels:
app: test-ephemeral-volumes
spec:
tolerations:
- effect: NoSchedule
key: dedicated
operator: Equal
value: karp1
nodeSelector:
managed-by: karpenter
nodepool: karp1
containers:
- name: shell
image: alpine:3.20
command: ["/bin/sh"]
args: ["-c", "while true; do date > /tmp/emptydir/date; sleep 1; done"]
volumeMounts:
- name: my-emptydir
mountPath: /tmp/emptydir
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
runAsUser: 1000
runAsGroup: 1000
runAsNonRoot: true
readOnlyRootFilesystem: true
seccompProfile:
type: RuntimeDefault
resources:
requests:
memory: "16Mi"
cpu: "10m"
limits:
memory: "32Mi"
cpu: "20m"
volumes:
# Use an ephemeral volume for the emptyDir
- name: my-emptydir
ephemeral:
volumeClaimTemplate:
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
And scale it to 16 replicas, the node can handle these 16 replicas. (16gb of "ephemeral volumes" in use). Whenever we schedule another 2 replicas, karpenter
spins up another node. This is unexpected because of:
kubectl describe node
command).Capacity:
cpu: 16
ephemeral-storage: 183411200Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 129121204Ki
pods: 110
Allocatable:
cpu: 15400m
ephemeral-storage: 182362624Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 127212468Ki
pods: 110
Any chance to make this behavior work?
Version
Karpenter Version: v0.16.3
Kubernetes Version: v1.21.14
Expected Behavior
Karpenter should take into account ephemeral volumeClaimTemplates when scheduling pods to nodes
Actual Behavior
Karpenter does not appear to take ephemeral storage into account when deciding whether it can schedule a pod to a node, and thus does not spin up new nodes to meet new additional capacity
Steps to Reproduce the Problem
I have a number of pods with the following spec:
I've got topolvm provisioning the
localnvme
storage on nodes based on the local NVM drives in the machinesMy situation is that I run 4 of these pods and specify the
i4i
family in my provisioner spec - Karpenter spins up ani4i.8xlarge
to run them on, which fills up as so:Now if I scale my deployment up to 5 pods Karpenter decides that that pod can schedule to the existing node - the CPU and memory "fits" but the storage does not, so k8s refuses to schedule the pod there and it sits in "Pending" status with the following Events:
Note that in this case I can just increase the cpu/memory requests of my pods to make those cause karpenter to make the correct scheduling decision, but this fall apart as you go up to bigger node sizes and different instance families with different specs
Resource Specs and Logs
See steps to reproduce
Community Note