Karpenter doesn't provision new node if no PVs are found on existing one

kamialie commented 1 year ago

Version

Karpenter: v0.13.1

Kubernetes: v1.22.0

Expected Behavior

I have a StatefulSet with some PVC. PVC requests a PV of local type, which at the moment doesn't provide dynamic provisioning. I am, therefore, using a local-static-provisioner project to create PVs on a new node. To give more background some scripts mounts devices to a custom directory, which the latter project exposes as PVs in Kubernetes.

Ideally I would want Karpenter to react to an event when no PV is currently available on a node it expects a pending pod to be scheduled, and try to start a new node, but that is probably out of scope of Karpenter, so I'm curious of your opinion/advice here.

Actual Behavior

Since Karpenter provisioned a big enough instance (in terms of CPU/memory) for two pods, Karpenter logs indicate that a second pod should be scheduled on the existing node, while it doesn't due to no available PV (I configured static provisioner to create a single local PV per node, which was consumed by the first pod).

Steps to Reproduce the Problem

Statically provision a PV on a node, create a StatefulSet with 2 replicas and specify resources requirements that ensure 2 pods being able to run on the same node.

Resource Specs and Logs

Warning FailedScheduling 10s (x4 over 3m29s) default-scheduler 0/13 nodes are available: 1 node(s) didn't find available persistent volumes to bind, 12 node(s) didn't match Pod's node affinity/selector.

DEBUG controller.events Normal {"commit": "1f7a67b", "object": {"kind":"Pod","namespace":"stateful","name":"hello-storage-1","uid":"7c4f75ee-4b32-44ca-8f45-e91013b76191","apiVersion":"v1","resourceVersion":"26158088"}, "reason": "NominatePod", "message": "Pod should schedule on ip-10-100-153-214.eu-central-1.compute.internal"}

tzneal commented 1 year ago

Can you provide a sample pod, PVC and PV spec that reproduces this issue?

kamialie commented 1 year ago

I'm currently working with StatefulSet, please, see example below, but I assume it is the same for deployment as well, or any other controller that can run multiple Pods on the same node.

PV is provisioned automatically by the project I referenced above.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: hello-app
spec:
  serviceName: hello-app
  replicas: 2
  selector:
    matchLabels:
      name: hello-app
  template:
    metadata:
      labels:
        name: hello-app
    spec:
      nodeSelector:
        role: test
      containers:
        - name: hello-app
          image: <>
          volumeMounts:
            - name: data
              mountPath: /etc/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: nvme-ssd
        resources:
          requests:
            storage: 1Gi
---
apiVersion: v1
kind: Service
metadata:
  name: hello-app
  labels:
    name: hello-storage
spec:
  clusterIP: None
  selector:
    name: hello-app

tzneal commented 1 year ago

Sorry, can you provide your storage class as well and the output of:

kubectl get csinode node-name -o yaml

for a node where this is running.

kamialie commented 1 year ago

StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    meta.helm.sh/release-name: local-storage-provisioner
    meta.helm.sh/release-namespace: storage
  creationTimestamp: "2022-08-01T10:50:00Z"
  labels:
    app.kubernetes.io/instance: local-storage-provisioner
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: provisioner
    helm.sh/chart: provisioner-2.6.0-alpha.1
  name: nvme-ssd
  resourceVersion: "25633126"
  uid: 3cc791b3-d5df-4605-9665-1fd43ad278a5
provisioner: kubernetes.io/no-provisioner
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

Node info:

apiVersion: v1
kind: Node
metadata:
  annotations:
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2022-08-02T09:01:23Z"
  finalizers:
  - karpenter.sh/termination
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: g4dn.12xlarge
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: eu-central-1
    failure-domain.beta.kubernetes.io/zone: eu-central-1b
    karpenter.k8s.aws/instance-cpu: "48"
    karpenter.k8s.aws/instance-family: g4dn
    karpenter.k8s.aws/instance-gpu-count: "4"
    karpenter.k8s.aws/instance-gpu-manufacturer: nvidia
    karpenter.k8s.aws/instance-gpu-memory: "16384"
    karpenter.k8s.aws/instance-gpu-name: t4
    karpenter.k8s.aws/instance-hypervisor: nitro
    karpenter.k8s.aws/instance-memory: "196608"
    karpenter.k8s.aws/instance-pods: "234"
    karpenter.k8s.aws/instance-size: 12xlarge
    karpenter.sh/capacity-type: on-demand
    karpenter.sh/initialized: "true"
    karpenter.sh/provisioner-name: test
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: ip-10-100-153-214.eu-central-1.compute.internal
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: g4dn.12xlarge
    nvme: "true"
    role: test
    topology.kubernetes.io/region: eu-central-1
    topology.kubernetes.io/zone: eu-central-1b
  name: ip-10-100-153-214.eu-central-1.compute.internal
  ownerReferences:
  - apiVersion: karpenter.sh/v1alpha5
    blockOwnerDeletion: true
    kind: Provisioner
    name: test
    uid: db1b6c68-1717-4912-8e70-31336f33aa2b
  resourceVersion: "26239766"
  uid: 9ea5fa4e-9e50-4b69-95d9-1bc3aadeec6a
spec:
  providerID: aws:///eu-central-1b/i-05ed1ab0fa33a447a
status:
  addresses:
  - address: 10.100.153.214
    type: InternalIP
  - address: ip-10-100-153-214.eu-central-1.compute.internal
    type: Hostname
  - address: ip-10-100-153-214.eu-central-1.compute.internal
    type: InternalDNS
  allocatable:
    attachable-volumes-aws-ebs: "39"
    cpu: 47810m
    ephemeral-storage: "103282620244"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 192687064Ki
    nvidia.com/gpu: "4"
    pods: "234"
  capacity:
    attachable-volumes-aws-ebs: "39"
    cpu: "48"
    ephemeral-storage: 113233900Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 195686360Ki
    nvidia.com/gpu: "4"
    pods: "234"
  conditions:
  - lastHeartbeatTime: "2022-08-02T12:28:38Z"
    lastTransitionTime: "2022-08-02T09:02:50Z"
    message: kubelet is posting ready status
    reason: KubeletReady
    status: "True"
    type: Ready
  - lastHeartbeatTime: "2022-08-02T12:28:38Z"
    lastTransitionTime: "2022-08-02T09:02:30Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2022-08-02T12:28:38Z"
    lastTransitionTime: "2022-08-02T09:02:30Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2022-08-02T12:28:38Z"
    lastTransitionTime: "2022-08-02T09:02:30Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  images:
  - names:
    - nvcr.io/nvidia/k8s-device-plugin@sha256:4918fdb36600589793b6a4b96be874a673c407e85c2cf707277e532e2d8a2231
    - nvcr.io/nvidia/k8s-device-plugin:v0.12.2
    sizeBytes: 109488523
  - names:
    - 602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni@sha256:3b6db8b6fb23424366ef91d7e9e818e42291316fa81c00c2c75dcafa614340c5
    - 602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.10.1-eksbuild.1
    sizeBytes: 107971097
  - names:
    - 602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni-init@sha256:6c70af7bf257712105a89a896b2afb86c86ace865d32eb73765bf29163a08c56
    - 602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni-init:v1.10.1-eksbuild.1
    sizeBytes: 106951309
  - names:
    - docker.io/ethersphere/eks-local-disk-provisioner@sha256:bec6b3d15ea3501b5e8c03e9d2c39f2117753dfefa530fb70cfaa2a88ad1df19
    - docker.io/ethersphere/eks-local-disk-provisioner:latest
    sizeBytes: 98423131
  - names:
    - k8s.gcr.io/sig-storage/local-volume-provisioner@sha256:63859b69f9dfc0858e5d8746218e435c36e205c041fb6d8baf71ad132e24737f
    - k8s.gcr.io/sig-storage/local-volume-provisioner:v2.4.0
    sizeBytes: 40509761
  - names:
    - 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/kube-proxy@sha256:c8abb4b8efc94090458f34e5f456791d9f7f57b5c99517b6b4e197305c1f10f6
    - 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/kube-proxy:v1.22.6-eksbuild.1
    sizeBytes: 35948825
  - names:
    - quay.io/brancz/kube-rbac-proxy@sha256:6237b9f78f17fb0beafc99ff38602add6f51a0fdfa5395785f8d31a8f833e363
    - quay.io/brancz/kube-rbac-proxy:v0.13.0
    sizeBytes: 25405919
  - names:
    - quay.io/prometheus/node-exporter@sha256:f2269e73124dd0f60a7d19a2ce1264d33d08a985aed0ee6b0b89d0be470592cd
    - quay.io/prometheus/node-exporter:v1.3.1
    sizeBytes: 10347719
  - names:
    - 385808790715.dkr.ecr.eu-central-1.amazonaws.com/hello-app@sha256:88b205d7995332e10e836514fbfd59ecaf8976fc15060cd66e85cdcebe7fb356
    - 385808790715.dkr.ecr.eu-central-1.amazonaws.com/hello-app:1.0
    sizeBytes: 4892466
  - names:
    - 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5
    sizeBytes: 298689
  nodeInfo:
    architecture: amd64
    bootID: ed8b943d-b2b6-4d6d-8950-69c755afb559
    containerRuntimeVersion: containerd://1.4.13
    kernelVersion: 5.4.204-113.362.amzn2.x86_64
    kubeProxyVersion: v1.22.9-eks-810597c
    kubeletVersion: v1.22.9-eks-810597c
    machineID: ec20c717d81a8e759d5ba1a42cfa863c
    operatingSystem: linux
    osImage: Amazon Linux 2
    systemUUID: ec20c717-d81a-8e75-9d5b-a1a42cfa863c

tzneal commented 1 year ago

We use the csinode object to determine the volume limits per CSI driver. There is no CSI driver that I can tell for these local volumes, so there's nothing to tell us that the volume won't mount.

The storage class provisioner is a non-existent kubernetes.io/no-provisioner which doesn't appear to be unique to these local volumes.

The only solution I'm seeing is to add a pod anti-affinity rule to your stateful set so you get no more than one per node:

      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: "name"
                  operator: In
                  values:
                  - hello-app
            topologyKey: "kubernetes.io/hostname"

kamialie commented 1 year ago

Yep, that's what I'm currently doing, but was looking for a possibility to run multiple pods on a single node. If that's the case, hostPath seems an easier approach for now, as dynamic local storage provisioning is not coming any time soon to Kubernetes, even way later to EKS.

github-actions[bot] commented 1 year ago

Labeled for closure due to inactivity in 10 days.

aws / karpenter-provider-aws